Bug 139 - Swizzle needs to be high priority capability in ISA
Summary: Swizzle needs to be high priority capability in ISA
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 142
  Show dependency treegraph
 
Reported: 2019-10-03 08:17 BST by Luke Kenneth Casson Leighton
Modified: 2020-11-19 03:56 GMT (History)
2 users (show)

See Also:
NLnet milestone: NLnet.2019.02
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2019-10-03 08:17:48 BST
We have MV.X / MV.swizzle and the vector overloads on LD and ST particularly on the address base gives an indirect LD/ST however we are missing stride-multiplied, and swizzled LD/ST.

Should they be added?
Comment 1 Jacob Lifshay 2019-10-03 10:08:36 BST
(In reply to Luke Kenneth Casson Leighton from comment #0)
> We have MV.X / MV.swizzle and the vector overloads on LD and ST particularly
> on the address base gives an indirect LD/ST however we are missing
> stride-multiplied, and swizzled LD/ST.
> 
> Should they be added?

strided LD/ST is definitely needed for reading/writing shader inputs/outputs.

swizzled LD/ST with an immediate swizzle is likely to not be necessary since we can just macro-op fuse a swizzle op with the memory op, but Vulkan does require support for dynamically selected swizzled images, so we will need a dynamic swizzle op of some sort (would be nice to not require it be only in memory ops, macro op fusion can take care of opcode proliferation). Because a swizzle only changes the order of subvector elements which are all nearby, we should not fall back to gather/scatter since gather/scatter is likely to be slower.
Comment 2 Luke Kenneth Casson Leighton 2019-10-03 10:57:03 BST
For dynamic LD/ST-Swizzle, R-Type would give 3 operands, src2 could be used as a dynamic swizzle specifier, particularly if vectorised and set to 8bit it will pack very efficiently.  The rest would be just like standard LD/ST where funct7 could be the (truncated) immediate.
Comment 3 Luke Kenneth Casson Leighton 2019-10-03 10:59:58 BST
Ok what format for strided LD/ST?

Not fallback to shuffle sounds good. Likewise the macro op fusion.

Also how or does SUBVL interact here?
Comment 4 Jacob Lifshay 2019-10-03 11:20:52 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)
> Ok what format for strided LD/ST?

I/S-type? (base, src/dest, and 1 or more stride bitfields, can convert to gather/scatter for strides that don't fit, suggest strides of n << s where n is unsigned 7 bits and s is unsigned 5 bits though larger formats with more immediate bits would be nice)

> 
> Not fallback to shuffle sounds good. Likewise the macro op fusion.
> 
> Also how or does SUBVL interact here?

Swizzles (different than shuffle) are usually at the intra-subvector level (same VL index over all input/output vectors), inter-subvector swizzle would be a separate instruction if needed (much less common, can fall back to mv.x).
Comment 5 Jacob Lifshay 2019-10-03 11:23:17 BST
(In reply to Luke Kenneth Casson Leighton from comment #2)
> For dynamic LD/ST-Swizzle, R-Type would give 3 operands, src2 could be used
> as a dynamic swizzle specifier, particularly if vectorised and set to 8bit
> it will pack very efficiently.  The rest would be just like standard LD/ST
> where funct7 could be the (truncated) immediate.

would also like dynamic-swizzle without LD/ST.
Comment 6 Jacob Lifshay 2019-10-03 11:24:51 BST
(In reply to Jacob Lifshay from comment #5)
> (In reply to Luke Kenneth Casson Leighton from comment #2)
> > For dynamic LD/ST-Swizzle, R-Type would give 3 operands, src2 could be used
> > as a dynamic swizzle specifier, particularly if vectorised and set to 8bit
> > it will pack very efficiently.  The rest would be just like standard LD/ST
> > where funct7 could be the (truncated) immediate.
> 
> would also like dynamic-swizzle without LD/ST.

swizzle is usually uniform, so we should have both scalar and vector swizzle specifiers.
Comment 7 Luke Kenneth Casson Leighton 2019-10-03 14:04:40 BST
(In reply to Jacob Lifshay from comment #4)

> > Also how or does SUBVL interact here?
> 
> Swizzles (different than shuffle) are usually at the intra-subvector level
> (same VL index over all input/output vectors), 

so SUBVL=4, then LD.swizzle-immediate="yzxw" would do reordering:

* SUBVL-element 0 goes into position/strided-offset y (1)
* SUBVL-element 1 goes into position/strided-offset z (2)
* SUBVL-element 2 goes into position/strided-offset x (0)
* SUBVL-element 3 goes into position/strided-offset w (3)

something like that?

> inter-subvector swizzle would
> be a separate instruction if needed (much less common, can fall back to
> mv.x).

i still can't see how vector-based (SUBVL=1) swizzle would even work,
with swizzles being only in the range 0-3.

for sub-vectors we get away with an 8-bit immediate or a register where
only the first 8 bits are used.

for the case where SUBVL=1, where would you even _start_ the swizzling
from?  the extreme case it actually turns into a VINDEXed gather/scatter
(so, yes, MV.X)


half-way in between, for when SUBVL=1, doesn't really seem to work.
Comment 8 Jacob Lifshay 2019-10-03 14:15:20 BST
(In reply to Luke Kenneth Casson Leighton from comment #7)
> (In reply to Jacob Lifshay from comment #4)
> 
> > > Also how or does SUBVL interact here?
> > 
> > Swizzles (different than shuffle) are usually at the intra-subvector level
> > (same VL index over all input/output vectors), 
> 
> so SUBVL=4, then LD.swizzle-immediate="yzxw" would do reordering:
> 
> * SUBVL-element 0 goes into position/strided-offset y (1)
> * SUBVL-element 1 goes into position/strided-offset z (2)
> * SUBVL-element 2 goes into position/strided-offset x (0)
> * SUBVL-element 3 goes into position/strided-offset w (3)
> 
> something like that?

yeah, though swizzles specify which source element each dest. element comes from rather than the other way around, because that allows a swizzle xxwz where one source element (x) is duplicated and another (y) is ignored completely.

> 
> > inter-subvector swizzle would
> > be a separate instruction if needed (much less common, can fall back to
> > mv.x).
> 
> i still can't see how vector-based (SUBVL=1) swizzle would even work,
> with swizzles being only in the range 0-3.

for a single input swizzle with SUBVL=1, identity or constant are the only allowed swizzles.

Do note that Vulkan allows swizzles to set elements to constant 0, 1, min/max int/uint, and 1.0 as well as x, y, z, or w.
Comment 9 Luke Kenneth Casson Leighton 2019-10-03 15:05:39 BST
https://libre-riscv.org/simple_v_extension/specification/ld.x/

Started this page.
Comment 10 Luke Kenneth Casson Leighton 2019-10-03 15:06:57 BST
https://www.khronos.org/registry/vulkan/specs/1.1-extensions/man/html/VkComponentSwizzle.html

Is that uptodate? It suggests no minmax, that max is the enum max.
Comment 11 Luke Kenneth Casson Leighton 2019-10-03 16:09:01 BST
Hmm ok so we need:

* LD unit strided (an immed)
* LD element strided (covered by std LD)
* LD index strided (covered by std LD)

and

* Vulkan swizzle, 0, 1, x, y, w, z per SUBVL-Group
* Swizzle by element can kinda be done with MV.X

Do we need:

* strided swizzle
* element strided swizzle
* index strided swizzle
Comment 12 Jacob Lifshay 2019-10-03 17:59:24 BST
(In reply to Luke Kenneth Casson Leighton from comment #10)
> https://www.khronos.org/registry/vulkan/specs/1.1-extensions/man/html/
> VkComponentSwizzle.html
> 
> Is that uptodate? It suggests no minmax, that max is the enum max.

yes, that's up to date (everything under /registry/vulkan/specs/ is where they publish new versions).

the max comes about from using 1 as the constant on normalized integers where 1.0 maps to uint/int max.

I thought min would be handy, though we don't need it.
Comment 13 Jacob Lifshay 2019-10-03 18:38:42 BST
(In reply to Luke Kenneth Casson Leighton from comment #11)
> Hmm ok so we need:
> 
> * LD unit strided (an immed)
> * LD element strided (covered by std LD)
> * LD index strided (covered by std LD)
> 
> and
> 
> * Vulkan swizzle, 0, 1, x, y, w, z per SUBVL-Group
> * Swizzle by element can kinda be done with MV.X
> 
> Do we need:
> 
> * strided swizzle
> * element strided swizzle
> * index strided swizzle

probably?

I think it would be better to separate the swizzle op from the ld/st op, allowing the swizzle instruction to be reused for reg -> reg swizzle (requiring only one swizzle ALU on small implementations) and larger implementations can macro-op fuse the swizzle with ld/st. this will also reduce the opcode proliferation to 12 or so (ld*, st*, swizzle) instead of the 20-30 that would otherwise be needed due to n^2 proliferation.

so, for swizzle ops, I think we should have:
[f]swizzlei: i-type; swizzle is immediate
[f]swizzle: r-type; swizzle is rs2

also have (less of a requirement):
[f]swizzle2: r4-type; swizzle is rs2
[f]swizzle2i: 3 register with at least 12 immediate bits; swizzle is immediate
8 instruction matrix transpose algorithm depends on fswizzle2[i]

the swizzle is specified with 3*SUBVL bits (trap on nonzero unused bits, allowing additional swizzle modes to be added later) encoded as:
000: rs1.x
001: rs1.y
010: rs1.z
011: rs1.w
100: rs3.x or 0 or 0.0
101: rs3.y or 1 or 1.0
110: rs3.z or int_max or -1.0
111: rs3.w or uint_max or 0.5 (or maybe pi or something else for fswizzle[i])
Comment 14 Luke Kenneth Casson Leighton 2019-10-03 20:39:25 BST
Is that MV.swizzle you mean, for macro op fusion? and FMV.swizzle

R4 types are a pig, they take an entire major opcode (or, funct3 so 8 of them, only). Even I type is big. Need to be really really certain there's benefit to using them.

F int times reg imm swizzle where both ops are either I type or R type that's 4 funct3 minor ops, and if using the RVV Major opcode we already have SV.setvl so that leaves only 3 funct3 minor opcodes left.

I type does 12 bit immed so yes could be used.

Like the pi idea.

New 3 reg format doable by using RVC encoding, or 4 3 3 for rd rs1 rs2.
Comment 15 Jacob Lifshay 2019-10-04 00:50:51 BST
(In reply to Luke Kenneth Casson Leighton from comment #14)
> Is that MV.swizzle you mean, for macro op fusion? and FMV.swizzle
> 
> R4 types are a pig, they take an entire major opcode (or, funct3 so 8 of
> them, only). Even I type is big. Need to be really really certain there's
> benefit to using them.
> 
> F int times reg imm swizzle where both ops are either I type or R type
> that's 4 funct3 minor ops, and if using the RVV Major opcode we already have
> SV.setvl so that leaves only 3 funct3 minor opcodes left.
> 
> I type does 12 bit immed so yes could be used.
> 
> Like the pi idea.
> 
> New 3 reg format doable by using RVC encoding, or 4 3 3 for rd rs1 rs2.

So, how about this:

+-----------+-------+-------+-------+-------+-------+------+
|           | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 |
+===========+=======+=======+=======+=======+=======+======+
| swizzle2  | rs3   | 00    | rs2   | rs1   | 000   | rd   |
+-----------+-------+-------+-------+-------+-------+------+
| fswizzle2 | rs3   | 01    | rs2   | rs1   | 000   | rd   |
+-----------+-------+-------+-------+-------+-------+------+
| swizzle   | 0     | 10    | rs2   | rs1   | 000   | rd   |
+-----------+-------+-------+-------+-------+-------+------+
| fswizzle  | 0     | 11    | rs2   | rs1   | 000   | rd   |
+-----------+-------+-------+-------+-------+-------+------+
| swizzlei  | imm                   | rs1   | 001   | rd   |
+-----------+                       +-------+-------+------+
| fswizzlei |                       | rs1   | 010   | rd   |
+-----------+-------+-------+-------+-------+-------+------+

Note it's the equivalent of 3 I-type instructions (with a bit of spare space next to [f]swizzle).

the element size would be specified using the SVPrefix encodings or a similar mechanism. Since swizzle is close to useless as a scalar instruction, the opcode space can be shared with instructions that can't/shouldn't be vectorized.

[f]swizzle2i is left for longer instruction formats if needed.
Comment 16 Luke Kenneth Casson Leighton 2019-10-04 06:16:44 BST
https://libre-riscv.org/simple_v_extension/specification/mv.x/
Dropped table in here as it is a MV effectively.

May move to separate swizzle instruction (and separate page)

Good call on being able to overload (no SUBVL, scalar only).

Remind me, why 4 regs on swizzle?
Comment 17 Jacob Lifshay 2019-10-04 06:33:33 BST
(In reply to Luke Kenneth Casson Leighton from comment #16)
> https://libre-riscv.org/simple_v_extension/specification/mv.x/
> Dropped table in here as it is a MV effectively.
> 
> May move to separate swizzle instruction (and separate page)
> 
> Good call on being able to overload (no SUBVL, scalar only).
> 
> Remind me, why 4 regs on swizzle?

2-input swizzle ([f]swizzle2):
first input vector: rs1
dynamic swizzle selector: rs2
second input vector: rs3
output: rd
Comment 18 Luke Kenneth Casson Leighton 2019-10-04 07:34:07 BST
(In reply to Jacob Lifshay from comment #17)

> > Remind me, why 4 regs on swizzle?
> 
> 2-input swizzle ([f]swizzle2):

ohhh yehyehyeh, i remember now - argh where did i see a link
to a spec on this?  this one isn't it:
https://www.khronos.org/opengl/wiki/Data_Type_(GLSL)#Swizzling

we need to make sure, have some pseudocode of the operations.
swizzle i understand, swizzle2 i don't.
Comment 19 Luke Kenneth Casson Leighton 2019-10-04 08:47:02 BST
i just managed to visualise the example that i saw.  it was of two
vectors where the "swizzler" (selector) was twice the length of
each of the individual vectors.

the example contained two vec4 args plus a swizzle array of *eight*
selectors, where the first four were applied to src1 and the last
four were applied to src3.

that means we need to use arguments for the selector (src2) of up
to 24 bits.

it also means that the destination vector is *twice* the length of the
src registers, which is not something that's been taken into consideration.

this is sufficiently weird that i'm concerned about the complexity
in hardware: it would be literally *the* only hardware-level instruction
where rules on VL length had to be violated (made non-uniform)

panfrost only has 8-bit swizzle specifiers:
https://gitlab.freedesktop.org/panfrost/mali-isa-docs/blob/master/Midgard.md

i'm wondering if this is just too complicated and what assembler would
look like, with and without the full swizzle/swizzle2 (3*SUBVL) capabilities.
Comment 20 Jacob Lifshay 2019-10-04 10:51:34 BST
(In reply to Luke Kenneth Casson Leighton from comment #19)
> i just managed to visualise the example that i saw.  it was of two
> vectors where the "swizzler" (selector) was twice the length of
> each of the individual vectors.

That's not what I was thinking of for [f]swizzle2.

[f]swizzle2 has all 3 inputs with the same VL or scalar where the swizzle selector has SUBVL=1.

So, as an example:
SUBVL = 4
VL = 2
rs1 (vector) = [[11, 12, 13, 14], [15, 16, 17, 18]]
rs2 (scalar) = 0o7620 (start at LSB digit and count up)
rs3 (vector) = [[19, 20, 21, 22], [23, 24, 25, 26]]

vshuffle2 will produce
rd = [[11, 13, 21, 22], [15, 17, 25, 26]]

if rs2 was instead:
rs2 (vector) = [0o4400, 0o3647]

vshuffle2 will produce
rd = [[11, 11, 19, 19], [26, 23, 25, 18]]

If it's still confusing, I can try to make a diagram.

Back to instruction encodings:
I did just realize that we will probably need the extra encoding space anyway since DESTSUBVL doesn't necessarily equal SRCSUBVL.

Note that rs2 always has SUBVL=1, so we don't need to encode that.

One alternative is to only support swizzle2 with DESTSUBVL == SRCSUBVL and to support swizzle[i] with separate DESTSUBVL and SRCSUBVL.

to generate a swizzle2 with non-matching SUBVL values, use a swizzlei right after to remove the unneeded elements.
Comment 21 Luke Kenneth Casson Leighton 2019-10-04 11:02:33 BST
worked it out - i got it confused with shuffle2, here:
https://www.khronos.org/registry/OpenCL/sdk/2.1/docs/man/xhtml/shuffle.html

more in a mo - twin-SUBVL doesn't sound right.  SUBVL is intended to
be applied globally.  the CSRs would need a total redesign to cope.
Comment 22 Jacob Lifshay 2019-10-04 11:10:28 BST
(In reply to Luke Kenneth Casson Leighton from comment #21)
> worked it out - i got it confused with shuffle2, here:
> https://www.khronos.org/registry/OpenCL/sdk/2.1/docs/man/xhtml/shuffle.html
> 
> more in a mo - twin-SUBVL doesn't sound right.  SUBVL is intended to
> be applied globally.  the CSRs would need a total redesign to cope.

Well, we should redesign them then, since we need multiple SUBVLs to handle code like the following (from SuperTuxKart):

https://github.com/supertuxkart/stk-code/blob/adc6b2a1764863fce17b63fec1f5bb0453ae275a/data/shaders/sp_grass.frag#L35

    vec4 layer_2 = sampleTextureLayer2(uv);
    o_diffuse_color = vec4(col.xyz, layer_2.z);

    o_normal_color.xy = 0.5 * EncodeNormal(normalize(normal)) + 0.5;
    o_normal_color.zw = layer_2.xy;

Note how only 2 elements of layer_2 are split off even though layer_2 has SUBVL=4 (vec4)

shuffle2 can be used to calculate o_normal_color (which is a vec4) by combining the result of EncodeNormal (vec2) with layer_2.xy (vec2 split off from vec4)
Comment 23 Jacob Lifshay 2019-10-04 11:12:30 BST
(In reply to Luke Kenneth Casson Leighton from comment #21)
> worked it out - i got it confused with shuffle2, here:
> https://www.khronos.org/registry/OpenCL/sdk/2.1/docs/man/xhtml/shuffle.html
> 
> more in a mo - twin-SUBVL doesn't sound right.  SUBVL is intended to
> be applied globally.  the CSRs would need a total redesign to cope.

I had always intended SUBVL to vary from value to value, even faster than VL would vary.
Comment 24 Luke Kenneth Casson Leighton 2019-10-04 12:41:29 BST
(In reply to Jacob Lifshay from comment #23)

> > more in a mo - twin-SUBVL doesn't sound right.  SUBVL is intended to
> > be applied globally.  the CSRs would need a total redesign to cope.
> 
> I had always intended SUBVL to vary from value to value, even faster than VL
> would vary.

Varying is not a problem at all. Having *two* SUBVLs (or even worse three), one for src1, one for src2 and another for rd, we're into major redesign territory.

The example there, all of the src and dest vectors are all vec4 ie SUBVL=4.

However what is missing is a per SUBVL-element predicate mask, and if you recall we specifically designed the SUBVL predication to apply the predicate bit to the whole group.

Without going into redesigns, the solution would be to ensure a full vector is copied.

In the example given, it turns out that the first two parts of the colour come from line 34, and the last two from line 35.

If that is not done, then by way of various passes I would expect that the elements be copied by non-SUBVL methods (using VL and predicate masking) followed by a swizzle copy that placed the one, two, or three unaltered elements into the dest.

OR...

This is perhaps what "identity" is for.

If identity is intended to mean that the indexed subelement is unaltered, we have a way to leave xy alone:

Col.zw = srccol.xy

Becomes

Swizzle Col, srccol, {identity, identity, x, y}

Meaning:

* leave col.x untouched
* leave col.y untouched
* set col.z to srccol.x
* set col.w to srccol.y

A separate pass would notice the identity  overlap with line 34 and combine them, but that is a different story.

Bottom line, think it through from the normal SIMD perspective that other GPUs use, they just simply don't have the capability to do mixed vec2,vec3,vec4 operations, they are all vec2 only, vec3 only or vec4 only.

(edit: at least, i haven't seen any)
Comment 25 Jacob Lifshay 2019-10-04 21:09:27 BST
(In reply to Luke Kenneth Casson Leighton from comment #24)
> (In reply to Jacob Lifshay from comment #23)
> 
> > > more in a mo - twin-SUBVL doesn't sound right.  SUBVL is intended to
> > > be applied globally.  the CSRs would need a total redesign to cope.
> > 
> > I had always intended SUBVL to vary from value to value, even faster than VL
> > would vary.
> 
> Varying is not a problem at all. Having *two* SUBVLs (or even worse three),
> one for src1, one for src2 and another for rd, we're into major redesign
> territory.
> 
> The example there, all of the src and dest vectors are all vec4 ie SUBVL=4.
> 
> However what is missing is a per SUBVL-element predicate mask, and if you
> recall we specifically designed the SUBVL predication to apply the predicate
> bit to the whole group.
> 
> Without going into redesigns, the solution would be to ensure a full vector
> is copied.
> 
> In the example given, it turns out that the first two parts of the colour
> come from line 34, and the last two from line 35.
> 
> If that is not done, then by way of various passes I would expect that the
> elements be copied by non-SUBVL methods (using VL and predicate masking)
> followed by a swizzle copy that placed the one, two, or three unaltered
> elements into the dest.
> 
> OR...
> 
> This is perhaps what "identity" is for.

In the vulkan api, identity is syntatic sugar for x, y, z, or w, matching the element written to.

> 
> If identity is intended to mean that the indexed subelement is unaltered, we
> have a way to leave xy alone:

even if we has an instruction like that, we would still need different destsubvl and srcsubvl for the swizzle on line 34, since EncodeNormal returns vec2 and it writes to o_normal_color which is a vec4.

> 
> Col.zw = srccol.xy
> 
> Becomes
> 
> Swizzle Col, srccol, {identity, identity, x, y}
> 
> Meaning:
> 
> * leave col.x untouched
> * leave col.y untouched
> * set col.z to srccol.x
> * set col.w to srccol.y
> 
> A separate pass would notice the identity  overlap with line 34 and combine
> them, but that is a different story.
> 
> Bottom line, think it through from the normal SIMD perspective that other
> GPUs use, they just simply don't have the capability to do mixed
> vec2,vec3,vec4 operations, they are all vec2 only, vec3 only or vec4 only.

AMDGPU just converts everything to individual 32-bit units and does operations element by element, which is an alternate path to needing SUBVL, though I would be worried about increased instruction count and ALU op packing inefficiencies, because larger shaders are likely to have very low VL values (less than 4).
Comment 26 Jacob Lifshay 2019-10-04 21:14:29 BST
(In reply to Luke Kenneth Casson Leighton from comment #24)
> (In reply to Jacob Lifshay from comment #23)
> 
> > > more in a mo - twin-SUBVL doesn't sound right.  SUBVL is intended to
> > > be applied globally.  the CSRs would need a total redesign to cope.
> > 
> > I had always intended SUBVL to vary from value to value, even faster than VL
> > would vary.
> 
> Varying is not a problem at all. Having *two* SUBVLs (or even worse three),
> one for src1, one for src2 and another for rd, we're into major redesign
> territory.

not really, swizzles would be the only ops with different SUBVL values.

note that in the suggested encoding, rs2's subvl is always 1, and we can assume that rs1 and rs3 have matching SUBVLs (can use an additional swizzle when they don't match), so we only need to encode the new (dest) SUBVL (which would be written to the csr) and can use the old (src) subvl from the csr.
Comment 27 Luke Kenneth Casson Leighton 2019-10-04 21:40:50 BST
rs2 subvl=1 makes sense, i was just looking at CORDIC and SLERP and they need the same trick.

An extra SUBVL just is not that simple.

It would be better to have swizzle2, swizzle3 and swizzle4 to truncate the subvector.

Noted about identity.

The alternative is to have a "ignore" swizzle enum value.

Ignore, identity, 0, 1, x y z w

Ignore basically brings back bit level predication on the SUBVL elements.

It then becomes possible to combine ignore at either the start or the end in order to copy data from vec4 to vec2 or vec2 to vec4, by treating *both* as a vec4 and "ignoring" 2 of the subelements.

Adding special SUBVL CSRs just for this one case is really not a good idea. Every extra CSR risks increasing beyond the 32/64 bits, resulting in greater context switch latency.

If however more than one op turns out to need it and there is no "workaround" we can reevaluate that.
Comment 28 Luke Kenneth Casson Leighton 2019-10-04 21:44:45 BST
Remember also we are quite lucky in SV in that vec4 can be constructed from scalar registers. So we can cheat slightly in certain cases by mapping one FP32 vec2 to x8 and another to x9, then treat x8 as a vec4.

:)
Comment 29 Jacob Lifshay 2019-10-05 00:00:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #27)
> rs2 subvl=1 makes sense, i was just looking at CORDIC and SLERP and they
> need the same trick.
> 
> An extra SUBVL just is not that simple.
> 
> It would be better to have swizzle2, swizzle3 and swizzle4 to truncate the
> subvector.

that's identical to adding a destsubvl field to the instruction, which is mostly what I was suggesting. the existing subvl mechanisms would give srcsubvl.
Comment 30 Jacob Lifshay 2019-10-05 00:02:11 BST
(In reply to Luke Kenneth Casson Leighton from comment #28)
> Remember also we are quite lucky in SV in that vec4 can be constructed from
> scalar registers. So we can cheat slightly in certain cases by mapping one
> FP32 vec2 to x8 and another to x9, then treat x8 as a vec4.

that only works when VL=1, since the elements are packed tightly.
Comment 31 Luke Kenneth Casson Leighton 2019-10-05 05:47:58 BST
(In reply to Jacob Lifshay from comment #30)


> that only works when VL=1, since the elements are packed tightly.

sshh, don't tell noone :)
Comment 32 Luke Kenneth Casson Leighton 2019-10-05 09:28:35 BST
(In reply to Jacob Lifshay from comment #29)

> > It would be better to have swizzle2, swizzle3 and swizzle4 to truncate the
> > subvector.
> 
> that's identical to adding a destsubvl field to the instruction, which is
> mostly what I was suggesting. the existing subvl mechanisms would give
> srcsubvl.

It's using the opcode to "store" the state, which, fortunately, means no CSR changes (whew)
Comment 33 Luke Kenneth Casson Leighton 2019-10-05 09:40:25 BST
> Ignore, identity, 0, 1, x y z w

So identity I think is a red herring, it is a "lazy" way of leaving out the swizzle specifier, ie by the time it gets to an actual opcode, "identity" has to have been removed and replaced with x->x or y->y etc.

Does that sound about right?
Comment 34 Jacob Lifshay 2019-10-06 04:42:02 BST
(In reply to Luke Kenneth Casson Leighton from comment #33)
> > Ignore, identity, 0, 1, x y z w
> 
> So identity I think is a red herring, it is a "lazy" way of leaving out the
> swizzle specifier, ie by the time it gets to an actual opcode, "identity"
> has to have been removed and replaced with x->x or y->y etc.
> 
> Does that sound about right?

yes.


I do think we shouldn't have ignore, because that increases the number of registers that need to be read for both swizzle and swizzle2 (need to read dest reg to combine values except in special circumstances) and swizzle2 (2 source swizzle, not swizzle with destsubvl=2) provides the functionality we need.

additionally, the extra encoding space is needed to allow producing all of 0, 1, int_max (1.0 for snorm), and uint_max (1.0 for unorm) for signed/unsigned integers and signed/unsigned normalized integers (basically fixed-point with all non-sign bits after the radix point).

note that normalized integers are required by the Vulkan spec.
Comment 35 Luke Kenneth Casson Leighton 2019-10-06 07:20:14 BST
(In reply to Jacob Lifshay from comment #34)

> I do think we shouldn't have ignore, because that increases the number of
> registers that need to be read for both swizzle and swizzle2 (need to read
> dest reg to combine values except in special circumstances) and swizzle2 (2
> source swizzle, not swizzle with destsubvl=2) provides the functionality we
> need.

i'm not able to say because i really do not understand how swizzle2 works.
i've spent several days researching it and simply cannot find anything,
not even based on the hints of swizzle2 being in LLVM IR.

can you please describe it in pseudocode in a simple loop?
Comment 36 Jacob Lifshay 2019-10-06 08:06:00 BST
(In reply to Luke Kenneth Casson Leighton from comment #35)
> i'm not able to say because i really do not understand how swizzle2 works.
> i've spent several days researching it and simply cannot find anything,
> not even based on the hints of swizzle2 being in LLVM IR.
> 
> can you please describe it in pseudocode in a simple loop?

sure.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=fa181fd305c70bde0372e0d99d873685

swizzle2 rd, rs1, rs2, rs3

fn swizzle2<Elm, Selector>(
    rd: &mut [Elm],
    rs1: &[Elm],
    rs2: &[Selector],
    rs3: &[Elm],
    vl: usize,
    destsubvl: usize,
    srcsubvl: usize)
where
    // Elm is a copyable type
    Elm: Copy,
    // Selector is a copyable type that can be converted into u64
    Selector: Copy + Into<u64>,
{
    const FIELD_SIZE: usize = 3;
    const FIELD_MASK: u64 = 0b111;
    for vindex in 0..vl {
        let selector = rs2[vindex].into();
        // selector's type is u64
        if selector >> (FIELD_SIZE * destsubvl) != 0 {
            // handle illegal instruction trap
        }
        for i in 0..destsubvl {
            let mut sel_field = selector >> (FIELD_SIZE * i);
            sel_field &= FIELD_MASK;
            let src = if (sel_field & 0b100) != 0 {
                rs1
            } else {
                rs3
            };
            sel_field &= 0b11;
            if sel_field as usize >= srcsubvl {
                // handle illegal instruction trap
            }
            let value = src[vindex * srcsubvl + (sel_field as usize)];
            rd[vindex * destsubvl + i] = value;
        }
    }
}
Comment 37 Luke Kenneth Casson Leighton 2019-10-06 08:53:26 BST
(In reply to Jacob Lifshay from comment #36)

> > can you please describe it in pseudocode in a simple loop?
> 
> sure.

ah superb. now i get it.

>
>             let mut sel_field = selector >> (FIELD_SIZE * i);
>             sel_field &= FIELD_MASK;
>             let src = if (sel_field & 0b100) != 0 {

This is the bit where I was talking about "if sel_field != 0b111" (to represent "masked out" / "ignore").

>                 rs1
>             } else {
>                 rs3
>             };

After this if you have "if rs1 == x0 continue" then swizzle may be implemented as simply setting rs3 to x0

destsubvl must *not* be a CSR, it can be a fmt3 subencoding.

This does however leave out being able to do setting to 0, 1 or other constants. I wonder why other ISAs do not have the constants as part of the ISA.

OH hang on. If SUBVL is to be ignored on rs2 (special treatment) we could also hypothetically use (set) the predicate on rs2 as the rs1/rs3 sub-element selector mask.

The only downside of that being that it is limited to 64 bits, so 64/SUBVL is 16 when SUBVL is 4, you can only cover up to 16 long VL.

More special treatment, limited range. Not looking attractive. 3 bits for the rs2 selector is much better.

L.
Comment 38 Jacob Lifshay 2019-10-06 09:17:08 BST
(In reply to Jacob Lifshay from comment #36)
> (In reply to Luke Kenneth Casson Leighton from comment #35)
> > i'm not able to say because i really do not understand how swizzle2 works.
> > i've spent several days researching it and simply cannot find anything,
> > not even based on the hints of swizzle2 being in LLVM IR.
> > 
> > can you please describe it in pseudocode in a simple loop?
> 
> sure.
> 
> https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=fa181fd305c70bde0372e0d99d873685
> 
> swizzle2 rd, rs1, rs2, rs3

There's also the swizzle instruction:

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=17d9891a5ceae49402a948f2cbf29ba6

swizzle rd, rs1, rs2

pub trait SwizzleConstants: Copy + 'static {
    const CONSTANTS: &'static [Self; 4];
}

impl SwizzleConstants for u8 {
    const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFF, 0x7F];
}

impl SwizzleConstants for u16 {
    const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFFFF, 0x7FFF];
}

impl SwizzleConstants for f32 {
    const CONSTANTS: &'static [Self; 4] = &[0.0, 1.0, -1.0, 0.5];
}

// impl for other types too...

pub fn swizzle<Elm, Selector>(
    rd: &mut [Elm],
    rs1: &[Elm],
    rs2: &[Selector],
    vl: usize,
    destsubvl: usize,
    srcsubvl: usize)
where
    Elm: SwizzleConstants,
    // Selector is a copyable type that can be converted into u64
    Selector: Copy + Into<u64>,
{
    const FIELD_SIZE: usize = 3;
    const FIELD_MASK: u64 = 0b111;
    for vindex in 0..vl {
        let selector = rs2[vindex].into();
        // selector's type is u64
        if selector >> (FIELD_SIZE * destsubvl) != 0 {
            // handle illegal instruction trap
        }
        for i in 0..destsubvl {
            let mut sel_field = selector >> (FIELD_SIZE * i);
            sel_field &= FIELD_MASK;
            let src = if (sel_field & 0b100) == 0 {
                &rs1[(vindex * srcsubvl)..]
            } else {
                SwizzleConstants::CONSTANTS
            };
            sel_field &= 0b11;
            if sel_field as usize >= srcsubvl {
                // handle illegal instruction trap
            }
            let value = src[sel_field as usize];
            rd[vindex * destsubvl + i] = value;
        }
    }
}
Comment 39 Jacob Lifshay 2019-10-06 09:23:09 BST
(In reply to Jacob Lifshay from comment #38)
> (In reply to Jacob Lifshay from comment #36)
> > (In reply to Luke Kenneth Casson Leighton from comment #35)
> > > i'm not able to say because i really do not understand how swizzle2 works.
> > > i've spent several days researching it and simply cannot find anything,
> > > not even based on the hints of swizzle2 being in LLVM IR.
> > > 
> > > can you please describe it in pseudocode in a simple loop?
> > 
> > sure.
> > 
> > https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=fa181fd305c70bde0372e0d99d873685
> > 
> > swizzle2 rd, rs1, rs2, rs3
> 
> There's also the swizzle instruction:
> 
> https://play.rust-lang.org/
> ?version=stable&mode=debug&edition=2018&gist=17d9891a5ceae49402a948f2cbf29ba6
> 
> swizzle rd, rs1, rs2

well, won't bother to fix the bugs in both instructions' example code :) (inverted condition, if in wrong place, probably more)
Comment 40 Luke Kenneth Casson Leighton 2019-10-06 09:30:40 BST
(In reply to Jacob Lifshay from comment #38)

>             let src = if (sel_field & 0b100) == 0 {
>                 &rs1[(vindex * srcsubvl)..]
>             } else {
>                 SwizzleConstants::CONSTANTS
>             };

ok yes i see how that works.

predication can be pseudo-added by:

if (sel_field == 0b111) continue.

thus skipping anything where the selector field requests it.


in swizzle2, i think "ignore" (with some thought / augmentation)
can be used to select either of rs1 or rs3, skipping *either* src
*or* dest.  twin-predication in other words.

this would get rid of the need to have separate dest-sub-vl and
src-sub-vl.  "skip / ignore" would simply *automatically* not
overwrite (or read) src or dest based on the skip/ignore-parts
of rs2.

we discussed "reconstructing" twin-predication state restoration,
a few months back.

the reason why i thought it would be an extremely bad idea is because
of the length of VL.  looping desperately through VL (up to 64 bits),
examining the masks to reconstruct the start-point: really bad idea.

however, here, because it's applied to SUBVL, and SUBVL is only 2,3 or 4,
it's fine.  that can be done as a (slightly complex) single combinatorial
block, possibly even just a lookup table.
Comment 41 Luke Kenneth Casson Leighton 2019-10-06 09:31:47 BST
(In reply to Jacob Lifshay from comment #39)

> well, won't bother to fix the bugs in both instructions' example code :)
> (inverted condition, if in wrong place, probably more)

yehyeh, it serves its purpose perfectly.  i'll do a (greatly-simplified)
version on the page, later, with just the core algorithm in as few lines
as possible.
Comment 42 Jacob Lifshay 2019-10-06 09:39:59 BST
(In reply to Luke Kenneth Casson Leighton from comment #37)
> (In reply to Jacob Lifshay from comment #36)
> 
> > > can you please describe it in pseudocode in a simple loop?
> > 
> > sure.
> 
> ah superb. now i get it.
> 
> >
> >             let mut sel_field = selector >> (FIELD_SIZE * i);
> >             sel_field &= FIELD_MASK;
> >             let src = if (sel_field & 0b100) != 0 {

should have been == instead of !=

> 
> This is the bit where I was talking about "if sel_field != 0b111" (to
> represent "masked out" / "ignore").

yeah, it conflicts with rs3.w for swizzle2 and with constants needed for normalized integers for swizzle.

> 
> >                 rs1
> >             } else {
> >                 rs3
> >             };
> 
> After this if you have "if rs1 == x0 continue" then swizzle may be
> implemented as simply setting rs3 to x0

that only works for swizzle, not for fswizzle, unless you think having a special case for fp reg 0 is acceptable.

> destsubvl must *not* be a CSR, it can be a fmt3 subencoding.

yup, thinking of putting it in the bits left over in funct7 or funct3.

> 
> This does however leave out being able to do setting to 0, 1 or other
> constants. I wonder why other ISAs do not have the constants as part of the
> ISA.
> 
> OH hang on. If SUBVL is to be ignored on rs2 (special treatment) we could
> also hypothetically use (set) the predicate on rs2 as the rs1/rs3
> sub-element selector mask.

the cleaner way to do that is that rs2 in scalar mode is the same 12-bit selector for every subvector. if different selectors are needed for every subvector, then rs2 can be a u16 vector.

having it be a simple u16 vector is much better than trying to pack it in a u64, since the selector is likely to need to be calculated per-VL-index.

> 
> The only downside of that being that it is limited to 64 bits, so 64/SUBVL
> is 16 when SUBVL is 4, you can only cover up to 16 long VL.
> 
> More special treatment, limited range. Not looking attractive. 3 bits for
> the rs2 selector is much better.
> 
> L.
Comment 43 Jacob Lifshay 2019-10-06 09:51:18 BST
(In reply to Luke Kenneth Casson Leighton from comment #40)
> (In reply to Jacob Lifshay from comment #38)
> 
> >             let src = if (sel_field & 0b100) == 0 {
> >                 &rs1[(vindex * srcsubvl)..]
> >             } else {
> >                 SwizzleConstants::CONSTANTS
> >             };
> 
> ok yes i see how that works.
> 
> predication can be pseudo-added by:
> 
> if (sel_field == 0b111) continue.

if we're going to do that, we really should increase the field size to 4 bits per element, since shuffle2 already uses them all (rs1 x, y, z, and w and rs3 x, y, z, and w)

though I am extremely disinclined to have something that sets the output subvl in a data-dependent way (basically the output type & complete layout), that seems like a giant mess of security vulnerabilities just waiting to happen.

also, what do you do when subvector 1 has 2 ignores, subvector 2 has 3 ignores, subvector 3 has 1 ignore, and so on?!

> 
> thus skipping anything where the selector field requests it.
> 
> 
> in swizzle2, i think "ignore" (with some thought / augmentation)
> can be used to select either of rs1 or rs3, skipping *either* src
> *or* dest.  twin-predication in other words.
> 
> this would get rid of the need to have separate dest-sub-vl and
> src-sub-vl.  "skip / ignore" would simply *automatically* not
> overwrite (or read) src or dest based on the skip/ignore-parts
> of rs2.
> 
> we discussed "reconstructing" twin-predication state restoration,
> a few months back.
> 
> the reason why i thought it would be an extremely bad idea is because
> of the length of VL.  looping desperately through VL (up to 64 bits),
> examining the masks to reconstruct the start-point: really bad idea.
> 
> however, here, because it's applied to SUBVL, and SUBVL is only 2,3 or 4,
> it's fine.  that can be done as a (slightly complex) single combinatorial
> block, possibly even just a lookup table.
Comment 44 Luke Kenneth Casson Leighton 2019-10-06 12:58:48 BST
(In reply to Jacob Lifshay from comment #43)
.
> > 
> > predication can be pseudo-added by:
> > 
> > if (sel_field == 0b111) continue.
> 
> if we're going to do that, we really should increase the field size to 4
> bits per element, since shuffle2 already uses them all (rs1 x, y, z, and w
> and rs3 x, y, z, and w)

Yes was just thinking that. Then shuffle could keep 3 bits for consts and xyzw and use the 4th bit for predication

> 
> though I am extremely disinclined to have something that sets the output
> subvl in a data-dependent way (basically the output type & complete layout),
> that seems like a giant mess of security vulnerabilities just waiting to
> happen.

Already sorted the algorithm was designed and implemented successfully in spike, for twin predication, last year (albeit for VL not SUBVL)

It is shown in the appendix pseudocode as well. The src idx and dest idx are incremented independently and BOTH will result in loop termination on reaching SUBVL.

> also, what do you do when subvector 1 has 2 ignores, subvector 2 has 3
> ignores, subvector 3 has 1 ignore, and so on?!

Stop the loop when either of the subindices reach SUBVL.

If the programmer fails to insert enough ignores to not "represent" differing SUBVLs, that is their lookout. They should have read the manual :)
Comment 45 Jacob Lifshay 2019-10-06 19:19:33 BST
(In reply to Luke Kenneth Casson Leighton from comment #44)
> (In reply to Jacob Lifshay from comment #43)
> .
> > > 
> > > predication can be pseudo-added by:
> > > 
> > > if (sel_field == 0b111) continue.
> > 
> > if we're going to do that, we really should increase the field size to 4
> > bits per element, since shuffle2 already uses them all (rs1 x, y, z, and w
> > and rs3 x, y, z, and w)
> 
> Yes was just thinking that. Then shuffle could keep 3 bits for consts and
> xyzw and use the 4th bit for predication
> 
> > 
> > though I am extremely disinclined to have something that sets the output
> > subvl in a data-dependent way (basically the output type & complete layout),
> > that seems like a giant mess of security vulnerabilities just waiting to
> > happen.
> 
> Already sorted the algorithm was designed and implemented successfully in
> spike, for twin predication, last year (albeit for VL not SUBVL)
> 
> It is shown in the appendix pseudocode as well. The src idx and dest idx are
> incremented independently and BOTH will result in loop termination on
> reaching SUBVL.
> 
> > also, what do you do when subvector 1 has 2 ignores, subvector 2 has 3
> > ignores, subvector 3 has 1 ignore, and so on?!
> 
> Stop the loop when either of the subindices reach SUBVL.
> 
> If the programmer fails to insert enough ignores to not "represent"
> differing SUBVLs, that is their lookout. They should have read the manual :)

Do note that there isn't a counter on the src side, since swizzle allows random access to all src elements in a subvector, whereas twin predication depends on both src and dest elements being accessed in-order.

I think having it be "unchanged" would be a better name, since it isn't actually that similar to twin-predication, it's basically only predicating the write on each rd element.

We would still need a destsubvl field since srcsubvl is often a different value, so that can't be used.

swizzlei would still need the 12-bit format due to not having enough immediate bits. we can get away with only 3 i-type funct3s used for [f]swizzlei by having one funct3 for destsubvl 1 through 3 for int and fp versions and a separate one for destsubvl = 4 that's shared between int/fp:

+--------+-----------+----+-----------+----------+-------+-------+------+
| int/fp | DESTSUBVL | 31 | 30:29     | 28:20    | 19:15 | 14:12 | 11:7 |
+========+===========+====+===========+==========+=======+=======+======+
| int    | 1 to 3    | 0  | DESTSUBVL | selector | rs    | 000   | rd   |
+--------+-----------+----+-----------+----------+-------+-------+------+
| fp     | 1 to 3    | 1  | DESTSUBVL | selector | rs    | 000   | rd   |
+--------+-----------+----+-----------+----------+-------+-------+------+
| int    | 4         | selector[11:0]            | rs    | 001   | rd   |
+--------+-----------+---------------------------+-------+-------+------+
| fp     | 4         | selector[11:0]            | rs    | 010   | rd   |
+--------+-----------+---------------------------+-------+-------+------+

the rest could be encoded as follows:

+-----------+-------+-----------+-------+-------+-------+------+
|           | 31:27 | 26:25     | 24:20 | 19:15 | 14:12 | 11:7 |
+===========+=======+===========+=======+=======+=======+======+
| swizzle2  | rs3   | DESTSUBVL | rs2   | rs1   | 100   | rd   |
+-----------+-------+-----------+-------+-------+-------+------+
| swizzle   | rs1   | DESTSUBVL | rs2   | rs1   | 100   | rd   |
+-----------+-------+-----------+-------+-------+-------+------+
| fswizzle2 | rs3   | DESTSUBVL | rs2   | rs1   | 101   | rd   |
+-----------+-------+-----------+-------+-------+-------+------+
| fswizzle  | rs1   | DESTSUBVL | rs2   | rs1   | 101   | rd   |
+-----------+-------+-----------+-------+-------+-------+------+

note how for [f]swizzle, rs3 == rs1

so it uses 5 funct3 values overall, which is appropriate, since swizzle is probably right after muladd in usage in graphics shaders.
Comment 46 Luke Kenneth Casson Leighton 2019-10-06 20:43:27 BST
Hi Jacob, do remember to trim context, especially in the bugtracker, pay extra attention.

Will write offlist then go back to sleep, leave it to you to decide whether to reply here.

Didn't realise quite how significant swizzle is, in retrospect from some GLSL files I glanced at, of course it is.

We need to adjust accordingly and give it a higher priority.

Even to the extent of adjusting SVP 64 bit format.

Can you check VBLOCK spec on swizzle table format to see if it is sufficient? Perhaps it needs to have an immediate form and a register form.

Darn. Relieved you got through to me how important this is.
Comment 47 Luke Kenneth Casson Leighton 2019-10-06 23:24:56 BST
(In reply to Jacob Lifshay from comment #45)
> (In reply to Luke Kenneth Casson Leighton from comment #44)
> > (In reply to Jacob Lifshay from comment #43)
> > .
> > > > 
> > > > predication can be pseudo-added by:
> > > > 
> > > > if (sel_field == 0b111) continue.
> > > 
> > > if we're going to do that, we really should increase the field size to 4
> > > bits per element, since shuffle2 already uses them all (rs1 x, y, z, and w
> > > and rs3 x, y, z, and w)
> > 
> > Yes was just thinking that. Then shuffle could keep 3 bits for consts and
> > xyzw and use the 4th bit for predication
> > 
> > > 
> > > though I am extremely disinclined to have something that sets the output
> > > subvl in a data-dependent way (basically the output type & complete layout),
> > > that seems like a giant mess of security vulnerabilities just waiting to
> > > happen.
> > 
> > Already sorted the algorithm was designed and implemented successfully in
> > spike, for twin predication, last year (albeit for VL not SUBVL)
> > 
> > It is shown in the appendix pseudocode as well. The src idx and dest idx are
> > incremented independently and BOTH will result in loop termination on
> > reaching SUBVL.
> > 
> > > also, what do you do when subvector 1 has 2 ignores, subvector 2 has 3
> > > ignores, subvector 3 has 1 ignore, and so on?!
> > 
> > Stop the loop when either of the subindices reach SUBVL.
> > 
> > If the programmer fails to insert enough ignores to not "represent"
> > differing SUBVLs, that is their lookout. They should have read the manual :)
> 
> Do note that there isn't a counter on the src side, since swizzle allows
> random access to all src elements in a subvector, whereas twin predication
> depends on both src and dest elements being accessed in-order.
> 

I need to write out the pseudocode to explain it. The random accessing comes *after* the inorder selection (including advancing the destcounter over "unchanged" dest items).

It's a little weird and obtuse.


> I think having it be "unchanged" would be a better name, since it isn't
> actually that similar to twin-predication, it's basically only predicating
> the write on each rd element.

Both predication on src and predication-on-dest-before-the-selection are needed.

For swizzle, one bit (4th bit) can be one of those, the other can be 0b111 in the other 3 bits.

For swizzle2 unfortunately and annoyingly if bit 3 is used as the rs1/rs3 selector we need *5* bits.

> 
> We would still need a destsubvl field since srcsubvl is often a different
> value, so that can't be used.

We'll work through why that isn't the case, in a different thread, either on or off list.

> 
> swizzlei would still need the 12-bit format due to not having enough
> immediate bits.

Yes. Very annoying. Don't have a good answer for that yet.


> 
> so it uses 5 funct3 values overall, which is appropriate, since swizzle is
> probably right after muladd in usage in graphics shaders.

These are 4 op, take up 50 to 80% of a major opcode. That's really, *really* high.

I wonder instead if we can fit some bits into VBLOCK or SVP64.

If these ops are that common and that important, trying to cram everything into OP32 is just asking for trouble.
Comment 48 Luke Kenneth Casson Leighton 2019-10-07 09:42:16 BST
https://salsa.debian.org/Kazan-team/pred-copy/blob/master/src/lib.rs

Ok reference lib for clarity.

The purpose of the exercise was to use the approach advocated for twin predication several months back: compute_index_from_other_index being that algorithm.  We can get away with using it even in 1 cycle after a contextswitch because SUBVL is limited to 4, where for VL the cost of a 64 bit long analysis was just too high.

Mathematically the two indices may be symmetrically derived: if you know one, you know the other.

Additionally, the two VLs - only one of which is needed if you set the rule that it is the programmer's responsibility to correctly set the masks - are also connected through a two way mathematically symmetrical formula, involving POPCOUNT:

DESTVL = Max(Min(SRCVL, srcmask.count_ones()) - destmask.count_zeros(), 0)

Thr reason being that without the destmask skipping, the number of copied elements will be equal to the number of 1s in the srcmask.

However if any are requested to be skipped, that number drops by the number of zeros.

Ok, the formula above is not totally accurate, you actually have to go through the... oh! I know, it's real simple:

DESTVL = compute_index_from_other_index(SRCVL)

that's it. That's the correct formula, accurate under all circumstances.

Therefore we *do not* need all FOUR parameters and pieces of state: only TWO are needed: either DESTVL and DESTidx, or SRCVL and SRCidx, the other two can *always* be derived, including after a context switch.

This of course assumes that we have both a destmask and a srcmask, regardless of their names (or how we get the information)

So the question then becomes: why would we ignore this? Is the creation of the masks too costly? Do they have to be done at runtime? How many cycles would it cost to set them up?
Comment 49 Luke Kenneth Casson Leighton 2019-10-07 10:48:53 BST
Moved REMAP to bug #143
Comment 50 Luke Kenneth Casson Leighton 2019-10-08 13:29:37 BST
Yeah! Worked out how to use REMAP to do cross product!

It would be 2 instructions: one FMUL and one FMACSUB (just like in the PLX 3D paper) except that the REMAP Shapes are in-place!

SHAPE0: xdim=3, ydim=1, inv=on, offs=2
SHAPE1: xdim=3, ydim=1, offs=1

These will offset and invert the vectors, giving the sequence 021 and 120 which can be applied to the two vectors by redirection.  It should work fine either when SUBVL=3 or when VL=3*N.

If nothing else this can be microcoded to not need any extra special hardware.
Comment 51 Jacob Lifshay 2019-10-09 07:39:03 BST
I'll try to explain my reasoning behind why I think DESTSUBVL should not be calculated from the swizzle as well as why having special ignore/unchanged settings in the swizzle is a bad idea:

I think we should design the instruction set in such a way as to be fail-safe as much as possible (similar to safe code in Rust), such that whatever value gets passed into the instructions will not cause them to write outside the assigned destination registers or (unexpectedly to the compiler) not completely write the result registers.

Basically, the instructions won't silently cause undefined behavior (such as overwriting more registers than allocated). I consider causing a trap in error conditions as defined behavior, since the program can use that trap to call an error handler, whereas overwriting near-by registers is silent (in the sense that it doesn't trap) undefined behavior.

Therefore, since DESTSUBVL changes how many registers are written to, DESTSUBVL should come from a value that the compiler knows is constant.

the swizzle value is likely to (at some point) come from all the way across the program, where it's much more likely that there will be a bug/exploit causing the value to take on invalid values. Having all swizzle instructions check the swizzle value to ensure it's valid and has the right output size will greatly improve security/reduce crashes due to buggy code (because it either causes a trap and is found during development or just changes the result value, not the result shape)

VL doesn't really have the above problem since VL is the outermost array index (hence succeeding instructions won't read past the end) and VL will (almost) always be set using sv.setvl, which has a built-in bounds check to make sure VL doesn't get set to a value larger than the compiler-allocated size.

ignore/unchanged isn't needed since:

1. [f]swizzle2 suffices to construct values where different elements come from different input vectors, 3-input swizzle is overkill and, when needed, can be constructed from 2 [f]swizzle2 instructions in sequence (and so on for 4 or more inputs).

2. [f]swizzle2 is much more powerful than [f]swizzle with ignore/unchanged, since it can reorder/duplicate/ignore elements from both inputs, whereas [f]swizzle can only do that from one input.

3. having the same swizzle format between [f]swizzlei (which most likely only has 12 immediate bits) and [f]swizzle[2] makes the instruction set more orthogonal and maybe easier to implement

4. [f]swizzle fits neatly in the encoding space for [f]swizzle2 by using the case where rs3 == rs1, which wouldn't otherwise be a useful encoding anyway. (using rs3 == 0 has problems for fswizzle2 due to not allowing f0 as a register)

5. [f]swizzle[2] without ignore/unchanged can already produce all possible combinations of inputs with all possible (1 to 4) DESTSUBVL values, so ignore/unchanged is redundant.

6. Not needing a popcount (even though it's quite small) simplifies instruction decode
Comment 52 Jacob Lifshay 2019-10-09 08:23:17 BST
(In reply to Luke Kenneth Casson Leighton from comment #47)
> (In reply to Jacob Lifshay from comment #45)
> > so it uses 5 funct3 values overall, which is appropriate, since swizzle is
> > probably right after muladd in usage in graphics shaders.
> 
> These are 4 op, take up 50 to 80% of a major opcode. That's really, *really*
> high.

62.5% for the encodings proposed in comment #45,
25% if the immediate forms are left out.

Note that the above is ignoring the %71 unused space in the immediate field for [f]swizzlei due to the upper bits of the swizzle field always being zero for smaller DESTSUBVL values and DESTSUBVL == 4 not sharing funct3 values.

Taking that in to account:
35.7% for the encodings proposed in comment #45

(I may have made a mistake in my calculations though)

> 
> I wonder instead if we can fit some bits into VBLOCK or SVP64.

Didn't check for VBLOCK, but I would be surprised if they won't fit in SVP64.

However, since swizzles are one of the few instruction types that require multiple non-trivial [1] SUBVL fields/values, I think having it encoded in the swizzle instruction itself [2] would be better.

1: I count instructions like load-gather/store-scatter as trivial, since all instruction arguments either use SUBVL (for src/dest reg) or SUBVL is always 1 (for address reg), so multiple SUBVL's don't need to be encoded

2: includes the prefix if the prefix is always required

> 
> If these ops are that common and that important, trying to cram everything
> into OP32 is just asking for trouble.

They can go in other opcode blocks. We could do something such as share the opcode with JAL (for example) and [f]swizzle[i/2] would only be valid inside VBLOCK or prefixed, where JAL isn't valid anyway.
Comment 53 Luke Kenneth Casson Leighton 2019-10-09 08:42:35 BST
read everything, still absorbing.  i see where you're going with the
"type-safety".

the immediate problem i can see with it is: predication (single or twin)
*already* makes the concept of bounds-checking runtime type-safety moot.
predication is *already* run-time-dependent, and developers (and
compilers) already have to "put up with" it, and make damn sure that they
get things right.

have to pack, ready for tomorrow, so will be a little distracted.

i do like [f]swizzle2 (rs1==rs3).  can you demonstrate *exact* equivalence
i.e. that src *and* dest may have elements arbitrarily "skipped" i.e. not
destroyed?

how is the following achieved?

   rd.yw = rs.zy

using twin-predication (twin "masks" - however they are called, i don't
care if they're named "unknown" or not) - it is dead-easy.  dest mask
equals 0b0011.

ah.

wait.

just listing that example, something just occurred to me.

you don't need twin-predication.  you need *single* predication, on the dest.

the reason is: the src-index-selection gives "appearance" of twin-predication.
with the dest having predication, all you do is: fill in *only two* src
swizzle indices.

*smacks-forehead*...


> 6. Not needing a popcount (even though it's quite small) simplifies
> instruction decode

it's not for decode, it's for exception-recovery after a context-switch.
it is absolutely the case that popcount would *not* be needed when
*starting* a newly-issued instruction because the offsets start at zero.

only when one of the offsets is not zero (returning from an exception)
does the other need to be recovered.

however this i believe is now moot as i realised twin-predication is
redundant.  can you confirm / check the logic/reasoning above?

will go over the rest (inline) again.
Comment 54 Luke Kenneth Casson Leighton 2019-10-09 08:54:43 BST
(In reply to Jacob Lifshay from comment #52)

> Taking that in to account:
> 35.7% for the encodings proposed in comment #45
> 
> (I may have made a mistake in my calculations though)

no problem.  we just have to keep an eye on it.

> > I wonder instead if we can fit some bits into VBLOCK or SVP64.
> 
> Didn't check for VBLOCK, but I would be surprised if they won't fit in SVP64.

because i didn't realise the importance of swizzle, the 16 bits are entirely
taken up with extending rd/rs1-3 to 7 bits, and with setting VL and VLMAX.

it'll need a redesign, to fit, where one bit indicates "swizzle-mode".
even there, it's only going to be possible to fit either 8-bit or 12-bit.
certainly not 16-bit (twin swizzle).

> However, since swizzles are one of the few instruction types that require
> multiple non-trivial [1] SUBVL fields/values, I think having it encoded in
> the swizzle instruction itself [2] would be better.

Midgard has *two* swizzle immediates for the ALU side.  hypothetically,
one could go in the instruction, another in the SVP64 block.

> 1: I count instructions like load-gather/store-scatter as trivial, since all
> instruction arguments either use SUBVL (for src/dest reg) or SUBVL is always
> 1 (for address reg), so multiple SUBVL's don't need to be encoded
> 
> 2: includes the prefix if the prefix is always required
> 
> > 
> > If these ops are that common and that important, trying to cram everything
> > into OP32 is just asking for trouble.
> 
> They can go in other opcode blocks. We could do something such as share the
> opcode with JAL (for example) and [f]swizzle[i/2] would only be valid inside
> VBLOCK or prefixed, where JAL isn't valid anyway.

are you considering the idea of adding special opcodes into the "reserved"
encodings of SVP48/64?  or...

oh, wait, yes, i see: if vectorisation/predication/something-else is triggered on a register being used by JAL (for example), it's taken to mean:

"JAL can't be vectorised, therefore you must mean swizzle".

where in SVP* that's easy to do (implicitly), but for VBLOCK we have to
be quite careful.  if the op does not have as many registers then the
"trigger" conditions might not activate, if the "shadowing" operation
needs a scalar reg where the "shadowed" op has a scalar.

JAL rs1 # scalar
SWIZZLE rs1 (scalar), rs2 (vector), rs3 (vector)

that's not going to work, because rs1 is a scalar, it won't trigger
the JAL to go "oh my rs1 is a vector they must have meant swizzle".
Comment 55 Jacob Lifshay 2019-10-09 09:09:59 BST
(In reply to Luke Kenneth Casson Leighton from comment #53)
> read everything, still absorbing.  i see where you're going with the
> "type-safety".
> 
> the immediate problem i can see with it is: predication (single or twin)
> *already* makes the concept of bounds-checking runtime type-safety moot.
> predication is *already* run-time-dependent, and developers (and
> compilers) already have to "put up with" it, and make damn sure that they
> get things right.

The difference with predication is that when the compiler enables it at all, it already knows it needs to have a valid value in dest before running the instruction. swizzle with ignore/unmodified didn't have that.

Also, predication doesn't allow an instruction to write past the end of the allocated registers, since the compiler always allocates for the full VL, whereas swizzle with ignore/unmodified could write past the end due to the compiler allocating for vec2 and the swizzle being vec4 (for example).

> 
> have to pack, ready for tomorrow, so will be a little distracted.

:)

> i do like [f]swizzle2 (rs1==rs3).  can you demonstrate *exact* equivalence
> i.e. that src *and* dest may have elements arbitrarily "skipped" i.e. not
> destroyed?
> 
> how is the following achieved?
> 
>    rd.yw = rs.zy

I'm assuming the types are both vec4.

That's achieved using fswizzle2:

// renamed variables to src and dest to not conflict
fswizzle2 rd=dest, rs1=src, rs2=swizzle, rs3=dest

where swizzle selects [rs3.x, rs1.z, rs3.z, rs1.y]

Having a separate instruction to load the swizzle constant is acceptable since writing to a swizzle is less common than reading from a swizzle.

in this case fswizzle2 can be combined with a move and/or a previous swizzled write assuming the previous write writes rd.xz

> 
> using twin-predication (twin "masks" - however they are called, i don't
> care if they're named "unknown" or not) - it is dead-easy.  dest mask
> equals 0b0011.

got it, though it would actually be 0b1010, since bits are counted from LSB: 0bwzyx

> 
> ah.
> 
> wait.
> 
> just listing that example, something just occurred to me.
> 
> you don't need twin-predication.  you need *single* predication, on the dest.
> 
> the reason is: the src-index-selection gives "appearance" of
> twin-predication.
> with the dest having predication, all you do is: fill in *only two* src
> swizzle indices.
> 
> *smacks-forehead*...

it all falls into place... :P

> 
> 
> > 6. Not needing a popcount (even though it's quite small) simplifies
> > instruction decode
> 
> it's not for decode, it's for exception-recovery after a context-switch.
> it is absolutely the case that popcount would *not* be needed when
> *starting* a newly-issued instruction because the offsets start at zero.

Popcount would be needed at decode since we want to be able to generate more than 1 element operation per clock cycle (otherwise our nice 128-bit-wide ALU will be mostly unused)

> 
> only when one of the offsets is not zero (returning from an exception)
> does the other need to be recovered.
> 
> however this i believe is now moot as i realised twin-predication is
> redundant.  can you confirm / check the logic/reasoning above?

Seems good to me.

> will go over the rest (inline) again.
Comment 56 Luke Kenneth Casson Leighton 2019-10-09 09:16:10 BST
ahhh i GOT it:

https://gitlab.freedesktop.org/panfrost/mali-isa-docs/blob/master/Midgard.md

Load/store words

13-16: mask
17-24: swizzle

ALU ops:

15-22: input 1 swizzle
if "input 2 inline constant" set:
    28-35: inline const 0-7
else:
    28-35: input 2 swizzle
40-47: write mask
    2 bits for each output when 32-bit, 1 bit when 16-bit

so they have 2 possible swizzles for inputs, and, crucially, the write
mask will be what says which of the input bits get ignored.


therefore as long as we can add a dest-mask, SUBVL *is* the DESTSUBVL.
DESTSUBVL *is* SUBVL.

a src mask is indeed unnecessary.


SUBVL-predication mode idea
----------------------

one possible solution here is to (somehow) jam in a mode which says
"hey, you know we said SUBVL doesn't have predication, and that the
predicate bit applies to *all* of SUBVL? well, um, for this operation,
it does".

what that would do is limit VL to a maximum of around 16, but i'm fine
with that.


8-bit-swizzle idea
-------------

going back to 8-bit on swizzle rather than 12-bit, the remaining 4 bits
can be used as a predicate SUBVL immediate DESTMASK.

that solves the issue of whether it's run-time safe (in the immediate
case).

also, with DESTSUBVL being redundant (directly equivalent *to* DESTMASK)
that gives two bits back [useable for other opcodes]

i'd really like to know why other GPUs only have 8-bit swizzle,
rather than having constants.  is setting from constants that common
that they really *need* special treatment?


SVP for 1 swizzle, opcode for the other idea
---------------------------------------

we definitely do not have room to fit 2 swizzles (16 bit) and it seems
from Midgard that they're certainly needed.

VBLOCK covers the multi swizzle scenario fine, but SVP does not.
if one swizzle immediate can be jammed into the SVP64 prefix,
the other can be in the operation.  if the SVP64 swizzle prefix is not
used, the "rules" state that the same swizzle (in the opcode)
applies to *both* operands.
Comment 57 Jacob Lifshay 2019-10-09 09:21:12 BST
(In reply to Luke Kenneth Casson Leighton from comment #54)
> (In reply to Jacob Lifshay from comment #52)
> > > I wonder instead if we can fit some bits into VBLOCK or SVP64.
> > 
> > Didn't check for VBLOCK, but I would be surprised if they won't fit in SVP64.
> 
> because i didn't realise the importance of swizzle, the 16 bits are entirely
> taken up with extending rd/rs1-3 to 7 bits, and with setting VL and VLMAX.
> 
> it'll need a redesign, to fit, where one bit indicates "swizzle-mode".
> even there, it's only going to be possible to fit either 8-bit or 12-bit.
> certainly not 16-bit (twin swizzle).

I was thinking that we'd only use separate swizzle instructions. If we wanted, they could be macro-op fused with preceding/succeeding ALU instructions.

> > However, since swizzles are one of the few instruction types that require
> > multiple non-trivial [1] SUBVL fields/values, I think having it encoded in
> > the swizzle instruction itself [2] would be better.
> 
> Midgard has *two* swizzle immediates for the ALU side.  hypothetically,
> one could go in the instruction, another in the SVP64 block.

Do note that we'd only need the 2 bits for another SUBVL field, not the 13/14 bits (12 for swizzle + SUBVL) for a whole swizzle immediate.

> > They can go in other opcode blocks. We could do something such as share the
> > opcode with JAL (for example) and [f]swizzle[i/2] would only be valid inside
> > VBLOCK or prefixed, where JAL isn't valid anyway.
> 
> are you considering the idea of adding special opcodes into the "reserved"
> encodings of SVP48/64?

Both of using reserved encodings in SVP48/64 and inside VBLOCK

>  or...
> 
> oh, wait, yes, i see: if vectorisation/predication/something-else is
> triggered on a register being used by JAL (for example), it's taken to mean:
> 
> "JAL can't be vectorised, therefore you must mean swizzle".
> 
> where in SVP* that's easy to do (implicitly), but for VBLOCK we have to
> be quite careful.  if the op does not have as many registers then the
> "trigger" conditions might not activate, if the "shadowing" operation
> needs a scalar reg where the "shadowed" op has a scalar.
> 
> JAL rs1 # scalar
> SWIZZLE rs1 (scalar), rs2 (vector), rs3 (vector)
> 
> that's not going to work, because rs1 is a scalar, it won't trigger
> the JAL to go "oh my rs1 is a vector they must have meant swizzle".

it doesn't trigger because of being the right register, but because jumps aren't valid inside VBLOCK. This requires VBLOCK encoding exactly which instructions are inside and not (which I recall it doing).
Comment 58 Luke Kenneth Casson Leighton 2019-10-09 09:31:30 BST
> it doesn't trigger because of being the right register, but because 
> jumps aren't valid inside VBLOCK. 

oh, duh :)  oh wait...

> This requires VBLOCK encoding exactly which instructions are inside 
> and not (which I recall it doing).

yes.

ok, JAL _is_ valid... as long as it's to *outside* the memory
area covered by the VBLOCK.  jumping to the *start* of the VBLOCK
is permitted.

likewise for BEQ (etc)
Comment 59 Jacob Lifshay 2019-10-09 10:19:19 BST
(In reply to Luke Kenneth Casson Leighton from comment #56)
> SUBVL-predication mode idea
> ----------------------
> 
> one possible solution here is to (somehow) jam in a mode which says
> "hey, you know we said SUBVL doesn't have predication, and that the
> predicate bit applies to *all* of SUBVL? well, um, for this operation,
> it does".
> 
> what that would do is limit VL to a maximum of around 16, but i'm fine
> with that.

That would require calculating the expanded predicate a lot, since most performance-critical instructions will be predicated.

Not sure if the extra instructions are worth it.

Also, if we ever want to make a high-performance GPU, VL being limited to 16 would be a major drawback (AMDGPU uses 32 (only on Navi) or 64, I think NVIDIA uses 32 though could be wrong)

> 
> 8-bit-swizzle idea
> -------------
> 
> going back to 8-bit on swizzle rather than 12-bit, the remaining 4 bits
> can be used as a predicate SUBVL immediate DESTMASK.
> 
> that solves the issue of whether it's run-time safe (in the immediate
> case).
> 
> also, with DESTSUBVL being redundant (directly equivalent *to* DESTMASK)
> that gives two bits back [useable for other opcodes]

actually only 1, since SUBVL <= 3 have SUBVL in what would be the upper bits of the immediate

> 
> i'd really like to know why other GPUs only have 8-bit swizzle,
> rather than having constants.  is setting from constants that common
> that they really *need* special treatment?

setting from constants is quite common, though a large part of the reason why I picked 3 bits per element is to support [f]swizzle2.

> 
> 
> SVP for 1 swizzle, opcode for the other idea
> ---------------------------------------
> 
> we definitely do not have room to fit 2 swizzles (16 bit) and it seems
> from Midgard that they're certainly needed.

I not so sure 2 swizzles per instruction are needed... seems excessive. I'm guessing Midgard just did that because they had the room.

> VBLOCK covers the multi swizzle scenario fine, but SVP does not.
> if one swizzle immediate can be jammed into the SVP64 prefix,
> the other can be in the operation.  if the SVP64 swizzle prefix is not
> used, the "rules" state that the same swizzle (in the opcode)
> applies to *both* operands.

I still think swizzles should be separate operations and macro-op fusion can be used if merging them with an ALU op is needed.
Comment 60 Jacob Lifshay 2019-10-09 10:31:26 BST
(In reply to Luke Kenneth Casson Leighton from comment #58)
> > it doesn't trigger because of being the right register, but because 
> > jumps aren't valid inside VBLOCK. 
> 
> oh, duh :)  oh wait...
> 
> > This requires VBLOCK encoding exactly which instructions are inside 
> > and not (which I recall it doing).
> 
> yes.
> 
> ok, JAL _is_ valid... as long as it's to *outside* the memory
> area covered by the VBLOCK.  jumping to the *start* of the VBLOCK
> is permitted.
> 
> likewise for BEQ (etc)

If BEQ is used as a vector compare-to-mask, it might be worthwhile to support branches as long as they branch to the next instruction. Different immediates could still be used to encode other instructions though.

On the other hand, I think having a separate compare-to-mask instruction (instead of repurposing BEQ) would be worthwhile, since we could use R-type instructions instead of B-type, allowing the compiler to greatly simplify register allocation. Also, that would avoid needlessly triggering the branch predictor, allowing for possible power savings.

We could additionally have a RMW on rd for compare, allowing simpler mask generation:

rd |= rs1 < rs2 where rs1 and rs2 are vectors

we could also support &=, ^=, and plain assignment.

Some of those ops are redundant with predicated compare, can weed those out later.
Comment 61 Luke Kenneth Casson Leighton 2019-10-09 11:43:35 BST
(In reply to Jacob Lifshay from comment #59)
> (In reply to Luke Kenneth Casson Leighton from comment #56)
> > SUBVL-predication mode idea
> > ----------------------
> > 
> > one possible solution here is to (somehow) jam in a mode which says
> > "hey, you know we said SUBVL doesn't have predication, and that the
> > predicate bit applies to *all* of SUBVL? well, um, for this operation,
> > it does".
>
> That would require calculating the expanded predicate a lot, since most
> performance-critical instructions will be predicated.

predicate bits go directly into the "shadow" side of the element-issue
Dependency Matrices.  there's no computation involved... oh wait, you
mean when setting them up?  yes if it's in a register... hmmm ok scratch
that idea.

> 
> Not sure if the extra instructions are worth it.

agreed.

> Also, if we ever want to make a high-performance GPU, VL being limited to 16
> would be a major drawback (AMDGPU uses 32 (only on Navi) or 64, I think
> NVIDIA uses 32 though could be wrong)

yehyeh.  ok scratch that one.
 
> > 8-bit-swizzle idea
> > -------------
> > 
> > going back to 8-bit on swizzle rather than 12-bit, the remaining 4 bits
> > can be used as a predicate SUBVL immediate DESTMASK.
> > 
> > that solves the issue of whether it's run-time safe (in the immediate
> > case).
> > 
> > also, with DESTSUBVL being redundant (directly equivalent *to* DESTMASK)
> > that gives two bits back [useable for other opcodes]
> 
> actually only 1, since SUBVL <= 3 have SUBVL in what would be the upper bits
> of the immediate
> 
> > 
> > i'd really like to know why other GPUs only have 8-bit swizzle,
> > rather than having constants.  is setting from constants that common
> > that they really *need* special treatment?
> 
> setting from constants is quite common, though a large part of the reason
> why I picked 3 bits per element is to support [f]swizzle2.

that doesn't make any immediate/logical sense to me.

> > 
> > 
> > SVP for 1 swizzle, opcode for the other idea
> > ---------------------------------------
> > 
> > we definitely do not have room to fit 2 swizzles (16 bit) and it seems
> > from Midgard that they're certainly needed.
> 
> I not so sure 2 swizzles per instruction are needed... seems excessive. I'm
> guessing Midgard just did that because they had the room.

i'd like to work out why before we make any firm decisions.

> > VBLOCK covers the multi swizzle scenario fine, but SVP does not.
> > if one swizzle immediate can be jammed into the SVP64 prefix,
> > the other can be in the operation.  if the SVP64 swizzle prefix is not
> > used, the "rules" state that the same swizzle (in the opcode)
> > applies to *both* operands.
> 
> I still think swizzles should be separate operations and macro-op fusion can
> be used if merging them with an ALU op is needed.

VBLOCK has room reserved for swizzles, so they can be applied to
operations.  the entire principle on which VBLOCK is based is to use
VBLOCK to "compactify" SVP encodings which would otherwise take up
significantly more space.

thus - one very important thing - concepts added to SVP *must* be
mappable (without loss of functionality) *to* VBLOCK in an easily
translatable fashion.  no exceptions.

here however we're looking at adding OP32 *instructions*, so it is new
territory.

thinking about it a little: i'd really like it to be possible to replace
(translate) these [new] opcodes with a VBLOCK applied to a straight C.MV.
if that's even practical.  and if it actually saves space.

usually it does (for groups of larger than 3 SVP instructions).

argh this is complicated! no wonder nobody in software/hardware libre
has designed a GPU before! :)
Comment 62 Luke Kenneth Casson Leighton 2019-10-09 13:52:49 BST
ok i have another idea: instead of the immediates being 12 bits, they're
8, *but*, there is *one* separate bit (in funct3?) which indicates
whether the swizzles are indices xyzw or constants {0,1,1/2,pi}.

if destmask is fitted into the top 4 bits, that gives the ability of
swizzlei to reach the upper elements:

    fmv.swizzlei rd.mask{0bw0x0} = rs.zw

or:

    fmv.swizzlei rd.mask{0bwz00} = {pi, 1.0}

it does mean however that four funct3s are needed (where with what
you came up with, jacob, there are three for swizzlei).

however what is gained instead is the ability for swizzlei to set
upper elements (beyond the straight sequence).

that in turn saves having to use an extra register - rs2 - in swizzle.
which would still need to be set up (loaded with an immediate).
Comment 63 Luke Kenneth Casson Leighton 2019-10-09 14:13:20 BST
(In reply to Jacob Lifshay from comment #60)

> On the other hand, I think having a separate compare-to-mask instruction
> (instead of repurposing BEQ) would be worthwhile, 

remember, the rule is: no new opcodes.  the reasons why we're breaking that
for vector ops is due to the sub-vector-element interactions.  mv.x because
it doesn't exist (as a scalar op).  swizzle the case is also clear.

adding a compare-to-mask is not clear at all... unless...

> since we could use R-type
> instructions instead of B-type, allowing the compiler to greatly simplify
> register allocation. 

hmmm... ok the thing that's a mess is: it has to work well as a scalar
instruction, as well.  if the [proposed] int-compare is made identical
to the FP-compare (stores a 1 in the LSB), and the vectorised-versions
likewise store a predicate mask, that would be ok.  will raise a new
bugreport about the idea.

> Also, that would avoid needlessly triggering the branch
> predictor, allowing for possible power savings.

that can be done by [outside of VBLOCK] jumping to the instruction immediately
following the branch.  so if the last instruction inside a VBLOCK is the BEQ
(etc) and the branch... you get the idea.  if the branch location *is* where
the PC will be on the very next cycle anyway, the branch predictor can be
disabled.
Comment 64 Luke Kenneth Casson Leighton 2019-10-09 17:38:16 BST
I worked out why Midgard has a dest mask and allows swizzle on both src operands: it allows FULL arbitrary ALU selection, covering all possible permutations of dest xyzw and src1 src2 xyzw.

The reason is that the dest needs only to know which elements are to be skipped ie the sequence is maintained, and *both* src element permutations may be reordered to match that sequence.

Also, again, no DESTSUBVL, that is implicit in the number of bits in the destmask and defines how many elements are copied from the two srces.

The swizzles in Midgard and the destmask are all immediates, not in regs (solving compiler error concerns), and the total number of bits is 8+8+4 = 20.

No wonder GPU ISAs are enormous.

Well, apart from destmask, VBLOCK can handle this case. Or, destmask has to be specified as a full 8 bit swizzle, not a 4 bit predicate-like bitmask.

That deals with the ALUs, although it leaves SVP with a "hole" (no equivalent in SVP to VBLOCK swizzle capability, at the moment).

If we go with the MV.swizzle OP32 format that you designed, that gives a way to cut down on opcodes used, by reordering data once and be done with it. It does mean having copies of data, which uses up regs.

However if registers are ever under pressure, VBLOCK can be used to refer to the data in-place, just like in Midgard.
Comment 65 Luke Kenneth Casson Leighton 2019-10-10 08:04:38 BST
Added bug #144 to cover the int CMP idea.
Comment 66 Luke Kenneth Casson Leighton 2019-10-10 08:09:48 BST
https://libre-riscv.org/simple_v_extension/specification/mv.x/ 

Added 4 funct3 version of swizzlei to see what it looks like.

Would prefer we went with that due to the reduced instr count advantage.

What was the connection between swizzle2, 3 bits per elements, etc. In comment #61 ? Confused there :)
Comment 67 Jacob Lifshay 2019-10-10 08:47:47 BST
(In reply to Luke Kenneth Casson Leighton from comment #61)
> (In reply to Jacob Lifshay from comment #59)
> > > 
> > > i'd really like to know why other GPUs only have 8-bit swizzle,
> > > rather than having constants.  is setting from constants that common
> > > that they really *need* special treatment?
> > 
> > setting from constants is quite common, though a large part of the reason
> > why I picked 3 bits per element is to support [f]swizzle2.
> 
> that doesn't make any immediate/logical sense to me.

3 bits per element are needed in able to address all source elements for [f]swizzle2:

000: rs1.x
001: rs1.y
010: rs1.z
011: rs1.w
100: rs3.x
101: rs3.y
110: rs3.z
111: rs3.w

Since we're already using 3 bits per element and since some swizzles in vulkan allow swizzles like [x, y, 0.0, 1.0], why not use the same 3 bits per element format for [f]swizzle[i] by using constants instead of rs3's elements?

It additionally provides a single instruction to load some commonly used FP constants, which is useful.

> 
> > > 
> > > 
> > > SVP for 1 swizzle, opcode for the other idea
> > > ---------------------------------------
> > > 
> > > we definitely do not have room to fit 2 swizzles (16 bit) and it seems
> > > from Midgard that they're certainly needed.
> > 
> > I not so sure 2 swizzles per instruction are needed... seems excessive. I'm
> > guessing Midgard just did that because they had the room.
> 
> i'd like to work out why before we make any firm decisions.
> 
> > > VBLOCK covers the multi swizzle scenario fine, but SVP does not.
> > > if one swizzle immediate can be jammed into the SVP64 prefix,
> > > the other can be in the operation.  if the SVP64 swizzle prefix is not
> > > used, the "rules" state that the same swizzle (in the opcode)
> > > applies to *both* operands.
> > 
> > I still think swizzles should be separate operations and macro-op fusion can
> > be used if merging them with an ALU op is needed.
> 
> VBLOCK has room reserved for swizzles, so they can be applied to
> operations.  the entire principle on which VBLOCK is based is to use
> VBLOCK to "compactify" SVP encodings which would otherwise take up
> significantly more space.

I still think having separate swizzle opcodes is the way to go, macro-op fusion will definitely work for combining swizzles with near-by operations. Additionally, this allows many swizzles in a single VBLOCK without the need to restart it in order to specify the swizzle again.

> 
> thus - one very important thing - concepts added to SVP *must* be
> mappable (without loss of functionality) *to* VBLOCK in an easily
> translatable fashion.  no exceptions.

I would actually argue that we should do it the other way around: all operations possible using VBLOCK must be able to be translated into an equivalent sequence using SVP and/or normal instructions.

That way, similar to RVC, the compiler can generate SVP instructions and then the assembler/final-instruction-selection can combine them into VBLOCK where practical in order to save space.

Note that this doesn't exclude SVP being completely mappable to VBLOCK, it just changes the canonical instruction form to SVP, which is honestly less complicated to parse and generate, due to not needing to group the instructions.

> 
> here however we're looking at adding OP32 *instructions*, so it is new
> territory.
> 
> thinking about it a little: i'd really like it to be possible to replace
> (translate) these [new] opcodes with a VBLOCK applied to a straight C.MV.
> if that's even practical.  and if it actually saves space.
> 
> usually it does (for groups of larger than 3 SVP instructions).

Swizzles require enough immediate data that they won't really fit in 16 bits anyway, using a new VBLOCK prefix just shifts the extra bits to another place rather than reducing it, just making everything more complicated.

swizzles can be done by allocating new 32-bit instructions (which both VBLOCK and SVP support).

reusing JAL (and other similar opcodes) works when inside SVP or VBLOCK, due to JAL otherwise being unused/invalid in that context. This gives us much more space to work with.

> 
> argh this is complicated! no wonder nobody in software/hardware libre
> has designed a GPU before! :)

Well, there have been a few designs... :P
Comment 68 Jacob Lifshay 2019-10-10 09:15:31 BST
(In reply to Luke Kenneth Casson Leighton from comment #62)
> ok i have another idea: instead of the immediates being 12 bits, they're
> 8, *but*, there is *one* separate bit (in funct3?) which indicates
> whether the swizzles are indices xyzw or constants {0,1,1/2,pi}.
> 
> if destmask is fitted into the top 4 bits, that gives the ability of
> swizzlei to reach the upper elements:
> 
>     fmv.swizzlei rd.mask{0bw0x0} = rs.zw
> 
> or:
> 
>     fmv.swizzlei rd.mask{0bwz00} = {pi, 1.0}

the swizzles that use constants (other than just for a load-immediate) would usually be of the form [x, z, 0.0, 1.0] where both input elements and constants are specified at the same time.

> 
> it does mean however that four funct3s are needed (where with what
> you came up with, jacob, there are three for swizzlei).
> 
> however what is gained instead is the ability for swizzlei to set
> upper elements (beyond the straight sequence).
> 
> that in turn saves having to use an extra register - rs2 - in swizzle.
> which would still need to be set up (loaded with an immediate).

That's true.
alternatively [f]swizzlei could be left alone, and separate [f]swizzle2i instructions be added:

swizzle2i rd, rs1, swizzle
is equivalent to
li rtemp, swizzle
swizzle2 rd=rd, rs1=rs1, rs2=rtemp, rs3=rd

Swizzle would be defined to read all of each subvector before writing the corresponding subvector to rd, tiny implementations can implement that by using a temporary register to store the read elements, still allowing operating on one element at a time. swizzle can detect all traps at the beginning, before writing anything, allowing subvector swizzles to not have to worry about the temporary register needing to be accessible for context-switching.

If there's enough free space in the [f]swizzlei encoding, some of it could be shared with [f]swizzle2i to save opcode space.

If we reuse JAL and other similar opcodes (SYSTEM? JALR?), we should have enough spare opcode space to be able to allocate that much space to swizzle due to its importance.
Comment 69 Jacob Lifshay 2019-10-10 09:25:01 BST
(In reply to Luke Kenneth Casson Leighton from comment #64)
> I worked out why Midgard has a dest mask and allows swizzle on both src
> operands: it allows FULL arbitrary ALU selection, covering all possible
> permutations of dest xyzw and src1 src2 xyzw.
> 
> The reason is that the dest needs only to know which elements are to be
> skipped ie the sequence is maintained, and *both* src element permutations
> may be reordered to match that sequence.
> 
> Also, again, no DESTSUBVL, that is implicit in the number of bits in the
> destmask and defines how many elements are copied from the two srces.
> 
> The swizzles in Midgard and the destmask are all immediates, not in regs
> (solving compiler error concerns), and the total number of bits is 8+8+4 =
> 20.
> 
> No wonder GPU ISAs are enormous.

They generally don't care about code size...

> 
> Well, apart from destmask, VBLOCK can handle this case. Or, destmask has to
> be specified as a full 8 bit swizzle, not a 4 bit predicate-like bitmask.
> 
> That deals with the ALUs, although it leaves SVP with a "hole" (no
> equivalent in SVP to VBLOCK swizzle capability, at the moment).

I would be surprised if we couldn't cram a simplified swizzle field (8 bits for swizzle, 1-2 bits to allow specifying which src reg is swizzled) into SVP64, if not we can always define a SVP80 format, though at that point it may be better to just use macro-op fusion. :)

> 
> If we go with the MV.swizzle OP32 format that you designed, that gives a way
> to cut down on opcodes used, by reordering data once and be done with it. It
> does mean having copies of data, which uses up regs.
> 
> However if registers are ever under pressure, VBLOCK can be used to refer to
> the data in-place, just like in Midgard.
Comment 70 Luke Kenneth Casson Leighton 2019-10-10 09:43:28 BST
(In reply to Jacob Lifshay from comment #69)

> > No wonder GPU ISAs are enormous.
> 
> They generally don't care about code size...

1k executables, handling 4x or 8x data workloads, I am not surprised.


> > That deals with the ALUs, although it leaves SVP with a "hole" (no
> > equivalent in SVP to VBLOCK swizzle capability, at the moment).
> 
> I would be surprised if we couldn't cram a simplified swizzle field (8 bits
> for swizzle, 1-2 bits to allow specifying which src reg is swizzled) into
> SVP64, if not we can always define a SVP80 format, though at that point it
> may be better to just use macro-op fusion. :)

:) and I use the ennntiire 80+ opcode space for VBLOCK. yes, i knoooow... 

Would you like to take a look at redesigning SVP64 to add 1 bit so that it can flip between swizzle mode and VLEN mode? the simplest option is to lose the extension of reg nums to 7 bits, that frees up as much as 4 bits

Or, if only some bits (not all) are discarded (particularly in FR4) then one clear bit can be freed up in the column marked "reserved" and that can be turned into a swizzle-or-VL selector.

Again to cram in as much as possible I am kiinda liking the "4 x 2bit swizzle plus 1 bit to say if it is consts or xyzw" thing. It would also allow space for a destmask.

No room for 2x swizzles sadly, unless a mode for specifying regs is also added. Can I leave that with you to think through?

Oh i wrote the swizzle consts up in VBLOCK, arbitrarily picked some consts, to see what it looks like, do let me know what you think.
Comment 71 Luke Kenneth Casson Leighton 2019-10-10 09:52:41 BST
(In reply to Jacob Lifshay from comment #68)

> the swizzles that use constants (other than just for a load-immediate) would
> usually be of the form [x, z, 0.0, 1.0] where both input elements and
> constants are specified at the same time.

With the destmask (as an immed) allowing selection of which to write as consts and which as values, would it be acceptable to have that as 2 instructions?

Don't know the answer, there.

> > that in turn saves having to use an extra register - rs2 - in swizzle.
> > which would still need to be set up (loaded with an immediate).
> 
> That's true.
> alternatively [f]swizzlei could be left alone, and separate [f]swizzle2i
> instructions be added:
> 
> swizzle2i rd, rs1, swizzle
> is equivalent to
> li rtemp, swizzle
> swizzle2 rd=rd, rs1=rs1, rs2=rtemp, rs3=rd

Which would save on a li.

> Swizzle would be defined to read all of each subvector before writing the
> corresponding subvector to rd, tiny implementations can implement that by
> using a temporary register to store the read elements, still allowing
> operating on one element at a time. swizzle can detect all traps at the
> beginning, before writing anything, allowing subvector swizzles to not have
> to worry about the temporary register needing to be accessible for
> context-switching.
> 
> If there's enough free space in the [f]swizzlei encoding, some of it could
> be shared with [f]swizzle2i to save opcode space.
> 
> If we reuse JAL and other similar opcodes (SYSTEM? JALR?), we should have
> enough spare opcode space to be able to allocate that much space to swizzle
> due to its importance.

SYSTEM.

We have *one* major opcode left (CUSTOM-0) and I'd like it to be for the vector ops.

So it is ok to use either the entirety of CUSTOM-0 and use RVV major for vector ops, or the other way round.

If we need 8x fmt3 then CUSTOM-0 should be for swizzles.

Even mv.x could remain in the RVV major op.
Comment 72 Luke Kenneth Casson Leighton 2019-10-10 12:14:22 BST
I cannot think why it took me 18 months to come up with the following idea:

Instead of having separate entries per register in the VBLOCK CAM formats, where each register is 5 precious bits wasted, have 1 reg, 2 reg, 3 reg and 4 reg variants that permit the detection to be triggered by *one* reg (eg rd) and apply to *multiple* regs, rd, rs1, rs2, rs3.

This should save a huge amount of space.

In the case of swizzle it means being able to put the entire swizzle group into under 24 bits, 29 including the reg it is to be "activated" on.

Lots to think about on the flight.