Bug 558 - gcc SV intrinsics concept
Summary: gcc SV intrinsics concept
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Alexandre Oliva
URL:
Depends on:
Blocks: 213
  Show dependency treegraph
 
Reported: 2020-12-27 04:37 GMT by Luke Kenneth Casson Leighton
Modified: 2021-01-20 23:40 GMT (History)
2 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-12-27 04:37:34 GMT
https://www.reddit.com/r/RISCV/comments/ieedmd/gnu_toolchain_with_riscv_vector_intrinsic_support/

after seeing the above the idea is to consider a push/pop context that is associated with variables.  this pretty much creates verbatim svp64 prefixes.
Comment 1 Luke Kenneth Casson Leighton 2020-12-28 14:54:04 GMT
alexandre, ok so i have an idea here: if i put in an NLnet grant request for adding bitmanip and cryptographic intrinsics to gcc would you be interested to add SV intrinsic support at "just the level above svp64"?

the idea would be to not actually add vector datatypes as was done for RVV (except vec2/3/4), instead to "mark" variables as "SV Vectorised".  and also to mark them as "saturated".

when these get used, svp64 prefixes are automatically added.

an array of uint64_t for example would automatically result in a "Vector" EXTRA2/3 mark.

an intrinsic "Push Vectorisation Context" would contain the MAXVL. Pop removes it.

the only actual intrinsic functions needed would be to support things like the crypto primitives and anything unusual that doesn't cleanly map to c (set before first opcode)


the rest of the gramt request would be for linux kernel module support.

a *separate* grant would be for hardware.

what do you think?
Comment 2 Jacob Lifshay 2020-12-28 20:41:28 GMT
I think the intrinsics should be designed differently and share their syntax with the clang version:

I designed the following based on the LLVM IR llvm.vp.* intrinsics, which will most likely be used for some of SV in LLVM:
https://llvm.org/docs/LangRef.html#vector-predication-intrinsics

add vector types using whatever __attribute__ magic you prefer (can reuse the vector_size attribute if preferrred):
template<typename Base, std::size_t MAXVL, std::size_t SUBVL>
typedef Base Vec[MAXVL][SUBVL] __attribute__((magic...));

Base is limited to [u]int8/16/32/64_t, pointers, __fp16, float, double, and __bf16.

Through compiler magic (or being instead defined as a struct), Vec acts like a struct in that you can assign it, pass it by value, it doesn't decay to pointer types in function arguments, etc. It has the same in-memory layout as the above array with no padding and alignof(Base) alignment.

Now you can create a max-4 element vector with floatx3 subvectors by writing Vec<float, 4, 3>

Like the currently existing vector_size attribute, you can just use the attribute to convert the array type to a SV vector type, you don't need it to be a template or match the above definition.

E.g. for C, you could write:
typedef float floatx3xmax10[10][3] __attribute__((magic...));

and now floatx3xmax10 is a SV vector with maxvl=10 and subvl=3 and an element type of float.

there is also typedefs:
typedef size_t vl_t;
typedef uint64_t mask_t;

it is legal to convert between size_t and vl_t, setvl instructions will be inserted by __sv_add and friends if needed.

All operations other than assignment, parameter passing, and indexing are taken care of by built-in functions:

// returns the computed VL
vl_t __sv_setvl(size_t vl, size_t maxvl); // maxvl must be a compile-time constant

all computation functions take vl as a parameter. it is Undefined Behavior if vl > MAXVL, vl == 0 is legal. all computed vectors have uninitialized contents for elements > vl unless otherwise specified.

Vec<Base, MAXVL, SUBVL> __sv_add(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
Vec<Base, MAXVL, SUBVL> __sv_sub(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
Vec<Base, MAXVL, SUBVL> __sv_mul(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
Vec<Base, MAXVL, SUBVL> __sv_muladd(Vec<Base, MAXVL, SUBVL> factor1, Vec<Base, MAXVL, SUBVL> factor2, Vec<Base, MAXVL, SUBVL> term, vl_t vl, mask_t mask);
Vec<Base, MAXVL, SUBVL> __sv_div(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
Vec<Base, MAXVL, SUBVL> __sv_mod(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
Vec<Base, MAXVL, SUBVL> __sv_and(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
Vec<Base, MAXVL, SUBVL> __sv_or(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
Vec<Base, MAXVL, SUBVL> __sv_xor(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
Vec<Base, MAXVL, SUBVL> __sv_shl(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
Vec<Base, MAXVL, SUBVL> __sv_shr(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
...fill more alu ops in
Vec<Base, MAXVL, SUBVL> __sv_saturating_add(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
...fill more alu ops in
mask_t __sv_compare_eq(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
mask_t __sv_compare_ne(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
mask_t __sv_compare_gt(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
mask_t __sv_compare_lt(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
mask_t __sv_compare_ge(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
mask_t __sv_compare_le(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);

// floating-point compares with opposite results on NaNs
mask_t __sv_compare_eq_unordered(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
mask_t __sv_compare_ne_ordered(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
mask_t __sv_compare_gt_unordered(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
mask_t __sv_compare_lt_unordered(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
mask_t __sv_compare_ge_unordered(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);
mask_t __sv_compare_le_unordered(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);

// __sv_merge_tail returns a vector with elements <= vl and mask bit set copied from head and elements > vl and/or mask bit clear copied from tail, this is done by copying/generating head to the registers holding tail. This will usually compile to zero additional instructions because it can be merged with the instruction computing head.
Vec<Base, MAXVL, SUBVL> __sv_merge_tail(Vec<Base, MAXVL, SUBVL> head, Vec<Base, MAXVL, SUBVL> tail, vl_t vl, mask_t mask);

// this can usually be merged with following instructions by using scalar instruction arguments
Vec<Base, MAXVL, SUBVL> __sv_splat(Vec<Base, 1, SUBVL> subvec, vl_t vl, mask_t mask);
Vec<Base, MAXVL, 1> __sv_splat(Base element, vl_t vl, mask_t mask);

// can usually be merged with other instructions
Vec<Base, MAXVL, SUBVL> __sv_twin_pred(Vec<Base, MAXVL, SUBVL> src, vl_t vl, mask_t src_mask, mask_t dest_mask);

// swizzle0-3 are compile-time constants
Vec<Base, MAXVL, 1> __sv_swizzle(Vec<Base, MAXVL, SRC_SUBVL> src, vl_t vl, mask_t mask, int swizzle0);
Vec<Base, MAXVL, 2> __sv_swizzle(Vec<Base, MAXVL, SRC_SUBVL> src, vl_t vl, mask_t mask, int swizzle0, int swizzle1);
Vec<Base, MAXVL, 3> __sv_swizzle(Vec<Base, MAXVL, SRC_SUBVL> src, vl_t vl, mask_t mask, int swizzle0, int swizzle1, int swizzle2);
Vec<Base, MAXVL, 4> __sv_swizzle(Vec<Base, MAXVL, SRC_SUBVL> src, vl_t vl, mask_t mask, int swizzle0, int swizzle1, int swizzle2, int swizzle3);

todo: add load/stores/mv.x/etc.
Comment 3 Jacob Lifshay 2020-12-28 20:46:15 GMT
(In reply to Jacob Lifshay from comment #2)
> All operations other than assignment, parameter passing, and indexing are
> taken care of by built-in functions:

I meant the above to apply to Vec<...> not to mask_t or vl_t, mask_t and vl_t are just plain typedefs.
Comment 5 Luke Kenneth Casson Leighton 2020-12-28 21:07:32 GMT
(In reply to Jacob Lifshay from comment #2)
> I think the intrinsics should be designed differently and share their syntax
> with the clang version:

what you propose is a massive amount of work, and is based on the (equally-as-harmful-as-SIMD) principle of "one vector instruction, one intrinsic".

given the huge amounts of time involved it also makes it impractical to apply for a grant.

also whilst that's fine for llvm, which is going to have a huge amount of work done by other teams (robin and simon amongst them), and be well-suited to SPIR-V, the intrinsics added for gcc rvv take a different approach: barely above assembler-level.

the sheer overwhelming quantity of intrinsics that you propose is precisely and exactly what i do not see as being a good idea: it's the "one vector instruction, one intrinsic" harmful principle, which spirals out of control.

additionally, gcc works on a completely different design principle from llvm, which makes the llvm intrinsics concept harder to apply... because the infrastructure isn't there.

SV is an abstraction layer that is supposed to be "easy" to insert into hardware, and simulators.  binutils also turns out to be (reasonably) straightforward as well.

up until i saw the rvv gcc work i was thinking that compilers would be excluded from that list.

what i'd like to see is a "just-above-assembler" level that does *not* add massive numbers of intrinsics: instead takes context from associated variables and overloads scalar "add" or "mul" to simply be... vector "add" and vector "mul", because it reflects how SV works.

done.

and the stunning thing is: there's not even any need to change the rest of the compiler (except to add the svp64 prefix).  the *standard scalar add* is inherently vectorised by virtue of the context.

no intrinsic vector mul.

no intrinsic vector add

no intrinsic scalar-vector mul

no intrinsic scalar-vector add

no intrinsic vector mul.

no intrinsic vector add

no intrinsic vector-scalar mul

no intrinsic vector-scalar add

you see where that's going?  total nightmare.  literally hundreds, getting on for *thousands* of intrinsics.

no.

have the *variables* be marked as scalar or vector.  assume that the *developer* is intelligent enough to mark them correctly.

where this gets hair-raising is when vec2/3/4 and swizzle is introduced.  there i believe it may be solved with a union / struct.  not sure yet.
Comment 6 Luke Kenneth Casson Leighton 2020-12-28 21:15:31 GMT
example:

mask_t __sv_compare_ge(Vec<Base, MAXVL, SUBVL> lhs, Vec<Base, MAXVL, SUBVL> rhs, vl_t vl, mask_t mask);

can be replaced with

* mark lhs variable as vectorised / scalar
* mark rhs variable as vectorised / scalar
* PUSH context MAXVL, mask
* <<<< STANDARD GCC COMPARE OPERATION (lhs, rhs) >>>>
* POP context

absolutely zero modification of the standard gcc compare ge operation.

dead-simple creation of an svp64 prefix wrapper around the compare_ge operation.

the idea is to do *minimal* work at the strategic point to aid and assist people to get away from having to write just above the assembly level, yet still be able to take advantage of SV "Vector Abstraction"

it becomes the developers responsibility to keep track of which variables have been marked as vectorised... *without* creating sv_vector_uint64_t, sv_vector_uint8_t etc. etc. etc. etc. etc. etc.


the scope and timescale is expected to be of the order of 8-10 weeks to proof-of-concept completion, not to put in a proposal that requires thousands of intrinsics that takes 6-8 months to write.
Comment 7 Jacob Lifshay 2020-12-28 21:40:13 GMT
(In reply to Luke Kenneth Casson Leighton from comment #5)
> (In reply to Jacob Lifshay from comment #2)
> > I think the intrinsics should be designed differently and share their syntax
> > with the clang version:
> 
> what you propose is a massive amount of work, and is based on the
> (equally-as-harmful-as-SIMD) principle of "one vector instruction, one
> intrinsic".

No matter how you express it in C/C++, it has to eventually translate to something very similar to the intrinsics I proposed. That's just how compilers work.

> given the huge amounts of time involved it also makes it impractical to
> apply for a grant.

I'd assume gcc allows intrinsics to be dynamically created (or by Python/other compile-time code generation):
sprintf(name, "__sv_%s%s", sat ? "_saturated" : "", op_name);
gcc_define_intrinsic(name, ...);
which means that all the intrinsics can be generated in a 100-200 line nested loop.

Assuming GCC is similar to every optimizing compiler I've worked on (mostly LLVM) I'm pretty sure that the IR would need instructions like those intrinsics no matter how it's represented in C/C++, so, obviously, it'll be less work to not have to do an additional translation step in the frontend.

The hard part that would be required for either your or my proposal, luke, is in the instruction selection stage, and there it is done using pattern matching. the new patterns are mostly copy/paste and/or can also be done using a few hundred/thousand line loop to generate them using Python when gcc is compiled.

> have the *variables* be marked as scalar or vector.  assume that the
> *developer* is intelligent enough to mark them correctly.

The problem is that your approach is highly likely to be waay more work in the compiler, since the whole frontend has to be modified to add the new variable kinds, the implicit SV state, etc. The approach I proposed maps almost directly to the IR.

Also, having the *types* be vector/scalar instead of *variables* enables building much nicer abstractions over them using C++ and templates. If the *variables* are marked as vector/scalar, then they can't be easily abstracted over, preventing many use cases at design-time.
Comment 8 Jacob Lifshay 2020-12-28 22:00:27 GMT
(In reply to Luke Kenneth Casson Leighton from comment #5)
> and the stunning thing is: there's not even any need to change the rest of
> the compiler (except to add the svp64 prefix).  the *standard scalar add* is
> inherently vectorised by virtue of the context.

I'm pretty sure it is waay more complex than that, since the compiler has to accurately represent all code, and the easiest way to accurately represent new operations (which is what all SV ops are) is by using intrinsics, since they look just like opaque function calls to most the compiler, so it won't try to optimize them in a way that isn't valid with SV.

> no intrinsic vector mul.
> 
> no intrinsic vector add
> 
> no intrinsic scalar-vector mul

I wasn't proposing having scalar-vector ops since those can be represented by _sv_mul(__sv_splat(lhs, ...), rhs, ...) and almost trivially pattern-matched at instruction selection time into the scalar-vector instructions.
Comment 9 Alexandre Oliva 2020-12-28 22:18:55 GMT
There's nothing "just" about getting GCC to turn scalar insns into
vector ones, I'm afraid.

Register allocation is largely driven by insn definitions and their
requirements, and the definitions carry machine modes requirements that
establish data size and what kind of data it is, which establish whether
the data fits at all in a register file, and how many registers of that
file are needed to hold an object of that machine mode.

The prefix also alters insn length, that influences computations about
branch distances and constant pool placement, and scheduling (units,
latency) is very significantly affected by the fact that vector-prefixed
insns are actually multiple insns issued in sequence.


Though GCC vectorizers can turn certain loops and sequences of
instructions into vectors, they largely rely on the available vector
modes.

It doesn't seem unreasonable to define all of the available combinations
of vector lengths and component modes as vector machine modes, and to
define the vector versions of scalar insns as template insns,
parameterized with the lengths and modes over each operand.

That's quite some work, though not particularly challenging.

What I'm not sure about how to model is the vector length: though maxvl
is a compile time constant, vl is dynamic, and it may vary; it needs a
register of its own, for us to be able to represent setting it up, and
modifying it as a side effect, but it has to somehow be constrained to
the compiler's notion of what the maxvl is for the insn, since that is
what guides register allocation.


There is a possibility of introducing vector-prefixed variants
mechanically, as modified versions of existing scalar insns, using
machinery similar to the way conditional insns are introduced on ARM,
but I'd have to look into that to see whether it's really viable.


Intrinsics can give direct access to features that the vectorizer can't
(yet?) introduce on its own; I suppose masks and twin predicates might
be missing in general, though I'd have to look at how conditionals are
dealt with in the vectorizer to tell for sure.

Introducing intrinsics that map more or less directly to the
corresponding versions of vector insns templates is probably, including
the extra prefix-only operands, is not unreasonable.  It is quite some
work but, again, not particularly challenging, and quite possibly
automatable.


My suggested approach is to start out with one scalar insn and a handful
of vector lengths, say 32-bit integer add with wrap-around, and try to
get them used on vectors, as intrinsics and through the vectorizer, and
from that try to estimate the amount of effort to cover more
possibilities.
Comment 10 Luke Kenneth Casson Leighton 2020-12-28 22:21:04 GMT
remenber, jacob: SIMD, RVV, SVE2 and AVX512 compiler intrinsics fundamentally reflect the underlying concepts of the ISA they map to.

with 188 instructions in RVV the RVV compilers will require 180x{elwidth permutations} minimum intrinsics to support the ISA assembly code.

the "tagging" of SV requires the *tag* concept to be recognised, and consequently a first alpha level prototype can get away with:

* vector tagging
* setvl intrinsic
* svp64 prefix context

err... that's it.

how can we get away with that? because there *are* no vector instructions: only prefixing.
Comment 11 Jacob Lifshay 2020-12-28 22:32:56 GMT
(In reply to Luke Kenneth Casson Leighton from comment #10)
> remenber, jacob: SIMD, RVV, SVE2 and AVX512 compiler intrinsics
> fundamentally reflect the underlying concepts of the ISA they map to.
> 
> with 188 instructions in RVV the RVV compilers will require 180x{elwidth
> permutations} minimum intrinsics to support the ISA assembly code.
> 
> the "tagging" of SV requires the *tag* concept to be recognised, and
> consequently a first alpha level prototype can get away with:
> 
> * vector tagging
> * setvl intrinsic
> * svp64 prefix context
> 
> err... that's it.
> 
> how can we get away with that? because there *are* no vector instructions:
> only prefixing.

that's all fine and good at the assembly/machine-code level, but that totally won't work throughout the rest of the compiler, as explained by Alexandre and me previously.
Comment 12 Jacob Lifshay 2020-12-28 22:40:24 GMT
Note that I proposed a different C/C++ attribute than vector_size since it needs to support subvectors and because I'm not sure how much work it is to relax the current gcc requirement that vectors must be power-of-2 sizes. If that's a front-end only requirement, than we could probably just reuse the vector_size attribute. If that's required in the GCC IR, then we should probably create our own attribute since that'll be less work.

Alex, I (and probably Luke too) don't need the automatic loop vectorizer to work for the initial implementation, so that can be left for later if desired.
Comment 13 Alexandre Oliva 2020-12-28 23:03:50 GMT
I don't really know whether power-of-two sizes are a requirement for the general infrastructure, rather than something that follows from the hardware architectures that define their vector machine modes that way.  It will take some experimentation.

As for the built-in vectorizer, it's not like it requires effort to set things up so that it can do its job, AFAIK it just looks at what's available and attempts to use it, so I don't expect there will be effort involved in making it work.
Comment 14 Jacob Lifshay 2020-12-28 23:11:58 GMT
(In reply to Alexandre Oliva from comment #13)
> I don't really know whether power-of-two sizes are a requirement for the
> general infrastructure, rather than something that follows from the hardware
> architectures that define their vector machine modes that way.  It will take
> some experimentation.

I'd guess someone is already working on removing power-of-2 requirements in order to support ARM SVE and RISC-V V.

> As for the built-in vectorizer, it's not like it requires effort to set
> things up so that it can do its job, AFAIK it just looks at what's available
> and attempts to use it, so I don't expect there will be effort involved in
> making it work.

ok, sounds good!
Comment 15 Luke Kenneth Casson Leighton 2020-12-28 23:18:01 GMT
(In reply to Alexandre Oliva from comment #9)
> There's nothing "just" about getting GCC to turn scalar insns into
> vector ones, I'm afraid.

ok.  so to explain: that is not what i am proposing.

not in an explicit way.

bear in mind that SV is effectively a built-in macro for-loop.  thus one potential approach (not one i am recommending) is to literally chuck out a batch of  add r3 add r4 add r5 add r6 instructions from gcc *as scalars* and have a post-analysis phase spot the patterns and vectorise them.

i do not recommend doing this: the only reason i mention it is so as to grok the concept of the hardware macro for-loop better.


alexandre can you take a look at the example given in the reddit page? let me find it.

> #include <riscv_vector.h>
> #include <stdio.h>
>
> void vec_add_rvv(int *a, int *b, int *c, size_t n) {
>   size_t vl;
>   vint32m2_t va, vb, vc;
>   for (;vl = vsetvl_e32m2 (n);n -= vl) {
>     vb = vle32_v_i32m2 (b);
>     vc = vle32_v_i32m2 (c);
>     va = vadd_vv_i32m2 (vb, vc);
>     vse32_v_i32m2 (a, va);
>     a += vl;
>     b += vl;
>     c += vl;
>   }
> }

this is the level i'd like to see supported.  bruce describes it as "barely above machine code" level.

the compiler *does not* do actual vectirisation.

the compiler *does not* know anything about VL.

vsetvl_xxx are intrinsics that literally get converted, verbatim, to a single assrmbly instruction.  gcc has been taught to recognise that vl is a size_t returned from this "function"



except, due to the nature of SV it is instead:

#include <sv_vector.h>
#include <stdio.h>

void vec_add_svv(int *a, int *b, int *c, size_t n) {
   size_t vl;
   __attribute__{sv_vector} uint32_t *va, *vb, *vc ;
   PUSH_SV_CONTEXT(MAXVL=8)
   for (;vl = vsetvl_e32m2 (n);n -= vl) {
     vb = b;
     vc = c;
     // this issues an svp64 prefixed add
     *va = *vb + *vc; // looks scalar: isn't
     // this issues an svp64 prefixed mv
     *a = *va; // again: looks scalar.
     // these really are scalar
     // because they are not vector intrinsics
     a += vl;
     b += vl;
     c += vl;
   }
   POP_SV_CONTEXT()
 }

> Register allocation is largely driven by insn definitions and their
> requirements, 

in the above example it is MAXVL which tells gcc, when the add and the mv is performed, that va, vb and vc have had 4 64 bit registers allocated to them.

why 4? because MAXVL=8 and va-c have been declared as uint32_t.  that means elwidth=32 and consequently 2 elements fit into 64 bit, so MAXVL=8 takes up 4 regs.

> and the definitions carry machine modes requirements that
> establish data size and what kind of data it is, which establish whether
> the data fits at all in a register file, and how many registers of that
> file are needed to hold an object of that machine mode.
> 
> The prefix also alters insn length, that influences computations about
> branch distances and constant pool placement, and scheduling (units,
> latency) is very significantly affected by the fact that vector-prefixed
> insns are actually multiple insns issued in sequence.

ok this, the instruction length alteration and the register allocation, yes, this i get needs doing.

> 
> Though GCC vectorizers can turn certain loops and sequences of
> instructions into vectors, they largely rely on the available vector
> modes.

right.  i do *not* recommend going down the autovectorisation route or "teaching" gcc about SV vectorisation at this point.  this will be months of work.

take a look at how the RVV gcc intrinsics work: i'm pretty certain that you'll find that they're a type of cheat, based effectively on function calls and data types (hence the #include) that have little actual "depth"
 
> It doesn't seem unreasonable to define all of the available combinations
> of vector lengths and component modes as vector machine modes, and to
> define the vector versions of scalar insns as template insns,
> parameterized with the lengths and modes over each operand.

later.  not now.  that's phase 2 (requiring many man-months)

what i would like to see happen here is a similar "cheat", where the register allocation is multiplied up by MAXVL when taken from the current Vector Stack Context, yet aside from that the actual "modification" to gcc is absolute bare minimum.
 
> That's quite some work, though not particularly challenging.
> 
> What I'm not sure about how to model is the vector length: though maxvl
> is a compile time constant, vl is dynamic, and it may vary; it needs a
> register of its own, for us to be able to represent setting it up,

yes.  if you look at how it's done in the RVV gcc patches, we need pretty much exactly the same thing, and i do mean exactly.

if the rvv gcc setvl code cannot be near-verbatim copied there is something wrong.


> and
> modifying it as a side effect, but it has to somehow be constrained to
> the compiler's notion of what the maxvl is for the insn, since that is
> what guides register allocation.

no, it is definitively MAXVL that specifies the register *allocation*, whilst it is VL that determines precisely how many of those registers actually got modified.

that number is most emphatically *not* - at this point, at this phase - gcc's "problem"

*phase 2* which will involve autovectorisation is where VL *will* be gcc's problem, because VL (and MAXVL) will be entirely hidden from the program writer.

this is *not* phase 2.

> 
> There is a possibility of introducing vector-prefixed variants
> mechanically, as modified versions of existing scalar insns, using
> machinery similar to the way conditional insns are introduced on ARM,
> but I'd have to look into that to see whether it's really viable.

a better lead would be that x86 "rex" tagging that jacob mentioned? how x86 turned 32 bit regs into 64 bit regs with a "tag".

> Intrinsics can give direct access to features that the vectorizer can't
> (yet?) introduce on its own;

to be absolutely clear: i am *not* proposing modification of gcc's vectorizer or getting to that phase in *any* way right now.

this is *specifically* an abbbsolute bare minimum modification to gcc that is just above assembly level.


> I suppose masks and twin predicates might
> be missing in general, though I'd have to look at how conditionals are
> dealt with in the vectorizer to tell for sure.

masks would be via the PUSH_SV_CONTEXT system, and via __attribute__{mask}.

a variable would be marked as being a predicate: the PUSH_SV_CONTEXT would name that variable as a src or dest mask.

register allocation would pick r3, r10 or r30 (for int predication) and would be responsible for pushing old contents into other regs.

other than that it would be the programmer's responsibility to think through the consequences of the SV_CONTEXT having had twin predication applied.

 
> Introducing intrinsics that map more or less directly to the
> corresponding versions of vector insns templates is probably, including
> the extra prefix-only operands, is not unreasonable.  It is quite some
> work but, again, not particularly challenging, and quite possibly
> automatable.

and something to do for a stage 2 grant proposal.

> 
> My suggested approach is to start out with one scalar insn and a handful
> of vector lengths, say 32-bit integer add with wrap-around, and try to
> get them used on vectors, as intrinsics and through the vectorizer, and
> from that try to estimate the amount of effort to cover more
> possibilities.

one instruction always works for me.  the hilarious thing is, though, alexandre, that in this case i think you'll find that the abstraction approach i am advocating, once you have added even one instruction absolutely every other instruction follows.

ok it may be the case that once one 2-operand instruction has been vectorised, every 2-op follows.

please understand though that i am *very specifically NOT*, in any way shape or form advocating alteration, augmentation or involvement of the gcc vectorizer or of autovectorization for this grant proposal.

this is *very specifically* a leveraging of gcc barely above machine code level to see how far that gets.

the only really significant modifications will be to the Register Allocation Table to get it to understand MAXVL multipliers.
Comment 16 Luke Kenneth Casson Leighton 2020-12-28 23:27:08 GMT
(In reply to Jacob Lifshay from comment #11)

> that's all fine and good at the assembly/machine-code level, but that
> totally won't work throughout the rest of the compiler, as explained by
> Alexandre and me previously.

i recognise that, and have no problem with it, at all.

we need something that can get us some funding in that fits within NLnet's Assure Programme.

the PET Programme *ENDED* Dec 1st.  we can NOT resubmit the gcc proposal.


the bare-metal modifications i am proposing should take no more than 8 to 11 weeks, easily.

on top of that the crypto-primitive intrinsics can be added, in a way that makes it appear to be the "primary focus" of the proposal, taking up a large proportion of the budget.

these can double up as unit tests that test the bare metal sv intrinsics

if however the key focus of the proposal appears to be *gcc* then it is highly likely to be rejected because the Assure Programme is about *crypto primitives*, not gcc.

you see where i am going with this?

also the bare metal intrinsics will help Lauri because he will be able to write in c code rather than pure assembler.
Comment 17 Luke Kenneth Casson Leighton 2020-12-28 23:35:28 GMT
(In reply to Jacob Lifshay from comment #8)
> (In reply to Luke Kenneth Casson Leighton from comment #5)
> > and the stunning thing is: there's not even any need to change the rest of
> > the compiler (except to add the svp64 prefix).  the *standard scalar add* is
> > inherently vectorised by virtue of the context.
> 
> I'm pretty sure it is waay more complex than that, since the compiler has to
> accurately represent all code, and the easiest way to accurately represent
> new operations (which is what all SV ops are) is by using intrinsics, since
> they look just like opaque function calls to most the compiler, so it won't
> try to optimize them in a way that isn't valid with SV.

an easy way to deal with that is to add a string prefix to the instructions inserted into the assembly stream.  svp64{something}.

this would stop them being treated as scalar instructions.  once through that phase a simple string-matching phase (in binutils) can remove that.


> > no intrinsic vector mul.
> > 
> > no intrinsic vector add
> > 
> > no intrinsic scalar-vector mul
> 
> I wasn't proposing having scalar-vector ops since those can be represented
> by _sv_mul(__sv_splat(lhs, ...), rhs, ...) and almost trivially
> pattern-matched at instruction selection time into the scalar-vector
> instructions.

this is a full-on compiler proposal.

if we put in a full-on gcc application for funding under the Assure Programme i can guarantee it will be rejected.

the trick i am proposing is borderline but doable, and it works *only* if cryptographic primitives are the *primary* focus of the Grant Application
Comment 18 Jacob Lifshay 2020-12-29 00:11:09 GMT
(In reply to Luke Kenneth Casson Leighton from comment #17)
> (In reply to Jacob Lifshay from comment #8)
> > (In reply to Luke Kenneth Casson Leighton from comment #5)
> > > and the stunning thing is: there's not even any need to change the rest of
> > > the compiler (except to add the svp64 prefix).  the *standard scalar add* is
> > > inherently vectorised by virtue of the context.
> > 
> > I'm pretty sure it is waay more complex than that, since the compiler has to
> > accurately represent all code, and the easiest way to accurately represent
> > new operations (which is what all SV ops are) is by using intrinsics, since
> > they look just like opaque function calls to most the compiler, so it won't
> > try to optimize them in a way that isn't valid with SV.
> 
> an easy way to deal with that is to add a string prefix to the instructions
> inserted into the assembly stream.  svp64{something}.
> 
> this would stop them being treated as scalar instructions.  once through
> that phase a simple string-matching phase (in binutils) can remove that.

the problem is *not* at the assembly level, the problem is in the GCC IR level, where everything is represented in a totally different way using things like Static-Single-Assignment form, and fundamentally different than just a munged assembly string. It's at the GCC IR level where all the optimizations are performed, and they need to have all operations represented, either in some opaque form (like I was proposing by using intrinsics *in the IR*) or in a form where all the optimizers are aware of the semantics and know that the instructions behave differently.
> 
> 
> > > no intrinsic vector mul.
> > > 
> > > no intrinsic vector add
> > > 
> > > no intrinsic scalar-vector mul
> > 
> > I wasn't proposing having scalar-vector ops since those can be represented
> > by _sv_mul(__sv_splat(lhs, ...), rhs, ...) and almost trivially
> > pattern-matched at instruction selection time into the scalar-vector
> > instructions.
> 
> this is a full-on compiler proposal.

Sorta yes, though having the compiler detect that particular pattern is quite easy.

remember, the compiler goes through that pattern matching layer (instruction selection) for all instructions. Every instruction (ignoring inline assembly) is generated as the result of matching some pattern -- so all we do is add patterns for vector-scalar ops instead of only having vector-vector patterns. It's really that trivial.
> 
> if we put in a full-on gcc application for funding under the Assure
> Programme i can guarantee it will be rejected.
> 
> the trick i am proposing is borderline but doable, and it works *only* if
> cryptographic primitives are the *primary* focus of the Grant Application

so, we have aes step be the first ALU op we implement :) the rest should be relatively easier once we have 1 working, since it's mostly copy/paste.
Comment 19 Luke Kenneth Casson Leighton 2020-12-29 00:43:58 GMT
(In reply to Jacob Lifshay from comment #18)

> the problem is *not* at the assembly level, the problem is in the GCC IR
> level, where everything is represented in a totally different way using
> things like Static-Single-Assignment form, and fundamentally different than
> just a munged assembly string. 

you get the general idea.  some sort of mark, that says "don't touch this"


> It's at the GCC IR level where all the
> optimizations are performed, and they need to have all operations
> represented, either in some opaque form (like I was proposing by using
> intrinsics *in the IR*) or in a form where all the optimizers are aware of
> the semantics and know that the instructions behave differently.

... but they don't act differently, do they?

the SV context applies externally.

you're still thinking in terms of "full compiler support", which will be rejected as a proposal.

therefore unless we find alternative sources of funding it's nice to discuss but not going to get done.

actually i realised that the MAXVL context
should in fact propagate through the standard gcc IR layer.

if one register has been marked as MAXVL=8 and it needs to be copied to an alternative location then the destination needs to inherit those exact properties: MAXVL=8 as well.

the RAT will then know it needs to grab a batch of 8 regs not one in order to perform the mv.

you are right about one thing: transferring from scalar to vector and back.  oh wait: 

    uint32_t a;
    __attribute__{sv_vector} uint32_t *va ;
   PUSH_SV_CONTEXT(MAXVL=8)
   va[1] = a;

not a problem after all.  in the IR thus would go, "hmm va is marked as vector, a is scalar, va is allocated to r40-47 and a is allocated to r3: i need to issue a mv r41, r3 here"




> > 
> > 
> > > > no intrinsic vector mul.
> > > > 
> > > > no intrinsic vector add
> > > > 
> > > > no intrinsic scalar-vector mul
> > > 
> > > I wasn't proposing having scalar-vector ops since those can be represented
> > > by _sv_mul(__sv_splat(lhs, ...), rhs, ...) and almost trivially
> > > pattern-matched at instruction selection time into the scalar-vector
> > > instructions.
> > 
> > this is a full-on compiler proposal.
> 
> Sorta yes, though having the compiler detect that particular pattern is
> quite easy.
> 
> remember, the compiler goes through that pattern matching layer (instruction
> selection) for all instructions. Every instruction (ignoring inline
> assembly) is generated as the result of matching some pattern -- so all we
> do is add patterns for vector-scalar ops instead of only having
> vector-vector patterns. It's really that trivial.

the trick i am proposing even that is not done or necessary or "taught to the compiler".

as long as the Register Allocation carries and respects the "batches" (propagates the MAXVL context) correctly my feeling is that the IR will cope perfectly fine.

the resource prioritisation (costings) for optimisation passes will increase by MAXVL registers, however i am not hugely concerned about optimisation phases initially.

> > 
> > if we put in a full-on gcc application for funding under the Assure
> > Programme i can guarantee it will be rejected.
> > 
> > the trick i am proposing is borderline but doable, and it works *only* if
> > cryptographic primitives are the *primary* focus of the Grant Application
> 
> so, we have aes step be the first ALU op we implement :) the rest should be
> relatively easier once we have 1 working, since it's mostly copy/paste.

Rijndael is more complex because the cycle mul phase involves a 64 bit x vec2 (or more likely 32bit x vec4) as inputs in order to get up to 128 bit.

crc32 (see riscv bitmanip) would be a far easier starting point, moving to Rijndael as a second wave of primitives within the proposal, when it comes to adding vec2/3/4


let me think overnight about the "autogeneration of full intrinsics" idea.  if you can show me that there exists a machine-readable list of all openpower instructions in gcc or that it is VERY fast to create one (hours not days), such that the entire job can be done in eight weeks flat i'll be ok considering putting that forward to NLnet.

the majority of the proposal *has* to be crypto primitives not gcc focussed.
Comment 20 Luke Kenneth Casson Leighton 2020-12-29 18:37:40 GMT
https://gcc.gnu.org/git?p=gcc.git;a=blob;f=gcc/regs.h;h=11416c47f6f81c8c8f197bde3752422de6f2d9ae;hb=eeb145317b42d5203056851435457d9189a7303d#l203

looks like the concept of pseudo-registers and hard registers has already been abstracted out.


https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/tree-vect-loop.c;h=d6f1ffcd386c1e0b63cac21fabf8e5bad9be99ca;hb=eeb145317b42d5203056851435457d9189a7303d#l69

wow.  the vector-loop pass is pretty much exactly as i described in pseudocode. the syntax for the made-up attribute is remarkably similar except a typedef is used.
Comment 21 Alexandre Oliva 2020-12-29 20:21:04 GMT
> thus one potential approach (not one i am recommending) is to literally chuck out a batch of  add r3 add r4 add r5 add r6 instructions from gcc *as scalars* and have a post-analysis phase spot the patterns and vectorise them

this would actually avoid one problem that we're going to face, namely, that operations on vectors, as far as GCC is concerned, are parallel, rather than sequential.

I can't immediately see how to model the fact that a single insn actually takes multiple steps, and the output of one step may be an input for another.  The only support GCC has for this is the early-clobber marker, that one is supposed to use on an output when it must NOT overlap with inputs.  That's the recommended modelling when the insn may be split after register allocation, and the solution involves preventing the compiler from reusing input registers as so-marked outputs.

But when you've got vectors and you wish to exploit the fact that the iterations are sequential, preventing overlapping allocations won't cut it.  Though you probably wouldn't be doing such tricks through the compiler to begin with, so maybe it is fine, after all.


Regardless, I'm coming to think that we may be able to save ourselves a lot of headaches in the tooling side, out of this significant difference from other "vector" systems, by having our vectors modelled as parallel rather than sequential.  In most cases, it won't matter, because there won't be overlaps, but in those that do, the hardware could easily detect it and switch from counting upwards to downwards when needed to get the semantics of parallel access right.

Fail-first mode and twin-predication are ones that immediately raise flags of potential incompatibilities with this change, but it might still be worth at least considering it.
Comment 22 Luke Kenneth Casson Leighton 2020-12-30 03:19:54 GMT
(In reply to Alexandre Oliva from comment #21)

> this would actually avoid one problem that we're going to face, namely, that
> operations on vectors, as far as GCC is concerned, are parallel, rather than
> sequential.

SIMD-batches in other words.  this is a potential route (when looking at the autovectorisation stage, say appx 1+ years from now)

> register allocation, and the solution involves preventing the compiler from
> reusing input registers as so-marked outputs.

sensible.

> But when you've got vectors and you wish to exploit the fact that the
> iterations are sequential, preventing overlapping allocations won't cut it. 
> Though you probably wouldn't be doing such tricks through the compiler to
> begin with,

no, this is definitely advanced optimisation, that would need a lot of time.

> so maybe it is fine, after all.
> 
> 
> Regardless, I'm coming to think that we may be able to save ourselves a lot
> of headaches in the tooling side, out of this significant difference from
> other "vector" systems, by having our vectors modelled as parallel rather
> than sequential. 

yes.  this was the basis of the "allocate a batch of registers" idea, which it looks like gcc already has underlying support for this because of SIMD.

from the auto-vector loop case i found the assumption there is that the SIMD width is fixed.  which is fine up to a point.  multi-issue in our case increases execution throughput.

in this case, in stage 2, the MAXVL would be set hardcoded to a value that, combined with elwidth, would always come out to an amount that took up exactly 4x 64 bit registers for example.

but to get to _that_ point we need stage 1 first 


> In most cases, it won't matter, because there won't be
> overlaps, but in those that do, the hardware could easily detect it and
> switch from counting upwards to downwards when needed to get the semantics
> of parallel access right.

this is quite a big change.  i will go over it in a separate (new) bugreport. it hae merit if we include an exicit mapreduce mode
   
> Fail-first mode and twin-predication are ones that immediately raise flags
> of potential incompatibilities with this change, but it might still be worth
> at least considering it.

this is why i wanted a first phase "just above bare metal" because once that is done and at least a SV_CONTEXT can be pushed on the stack there will be a better working knowledge and an incremental base.
Comment 23 Alexandre Oliva 2020-12-30 19:55:57 GMT
today's conversation about VL=0 got me thinking that VL is dynamic, which means the compiler may have to be able to mv from and to it.  a setvli insn that sets VL to MAXVL, or even to a different immediate, won't quite cut it.

I think the compiler has to be able to represent it as something it can set from another register, say one previously loaded from it, so that it can operate on vectors of different dynamic lengths, at least if we wish to be able to expose fail-on-first (which is where vl becomes visible as a distinct quantity from maxvl, IIUC).

This means it may not be enough for setvli and some privileged register-restore for interrupts to be able to set vl from an immediate or from memory: the compiler has to be able to save it and restore it elsewhere.

This even relates with calling conventions: are VLI and VL supposed to be preserved across function calls, so a function that uses them has to save the state and restore it before returning, or are the call-clobbered and it's up to the caller to preserve them?

Having the vector state represented as a call-preserved register (set) would even enable the (userland) exception handling machinery to restore the register while unwinding the stack.

Though having it preserved by the caller might also make a lot of sense, since the caller might not even have to save it, and instead just know how to set it back up.
Comment 24 Jacob Lifshay 2020-12-30 20:22:39 GMT
(In reply to Alexandre Oliva from comment #23)
> today's conversation about VL=0 got me thinking that VL is dynamic, which
> means the compiler may have to be able to mv from and to it.  a setvli insn
> that sets VL to MAXVL, or even to a different immediate, won't quite cut it.

there is a setvli which sets VL to an immediate.
there is also a setvl which sets VL to the minimum of a value read from a register and an immediate (used for limiting VL to the space allocated by the register allocator). both are unprivileged instructions and intended to be used in user code.

> I think the compiler has to be able to represent it as something it can set
> from another register, say one previously loaded from it, so that it can
> operate on vectors of different dynamic lengths, at least if we wish to be
> able to expose fail-on-first (which is where vl becomes visible as a
> distinct quantity from maxvl, IIUC).
> 
> This means it may not be enough for setvli and some privileged
> register-restore for interrupts to be able to set vl from an immediate or
> from memory: the compiler has to be able to save it and restore it elsewhere.

I expect there to be a VL SPR that can be read using a single mfspr instruction. setvl[i] is used for writing to VL, since it is VL = min(value, MAXVL) where MAXVL is the immediate containing the number of vector elements that the register allocator allocated space for.

> This even relates with calling conventions: are VLI and VL supposed to be
> preserved across function calls, so a function that uses them has to save
> the state and restore it before returning, or are the call-clobbered and
> it's up to the caller to preserve them?

I'm inclined to define two different ABIs where one has VL be clobbered by calls and the other has it be saved by calls, though I'm not sure which should be default.

There should also be a way to pass VL into a function as an argument through the VL register and optionally have the caller expect the same VL to be set on return, sorta like how structure return pointers are treated.

We may specify that VL must also be passed in r3 or something, but I don't know that that's necessary.

> Having the vector state represented as a call-preserved register (set) would
> even enable the (userland) exception handling machinery to restore the
> register while unwinding the stack.

I'd expect VL to usually need to be reloaded or just be unused after a stack-unwind anyway since that usually causes code to do completely different stuff like error handling or terminating, so either way would work here -- we just pick whichever is consistent with the default ABI.

> Though having it preserved by the caller might also make a lot of sense,
> since the caller might not even have to save it, and instead just know how
> to set it back up.
Comment 25 Alexandre Oliva 2020-12-30 20:48:48 GMT
ok, good, set from register was one of the pieces that seemed to be missing to me.  

how about some means to copy vl to a register?


hmm...  passing vectors in registers, now that's something I hadn't yet thought of.  it makes sense, though it's a little uncommon to have very wide types passed in (lots of) registers, let alone with a VL on the side.


as for exceptions...  though they're exceptional indeed, several languages use them for a lot more than just termination.  well-structured error handling and recovery could still benefit from preservation of state.

though the more I think about it, the more convinced I am that the caller is in a better position to tell whether vl needs saving, and how to restore it after a call if needed, whether in the regular or the exceptional exit paths.
Comment 26 Luke Kenneth Casson Leighton 2020-12-30 20:51:49 GMT
(In reply to Jacob Lifshay from comment #24)

> I'm inclined to define two different ABIs where one has VL be clobbered by
> calls and the other has it be saved by calls, though I'm not sure which
> should be default.

the generally-accepted industry-standard wisdom here is that making function calls *at all* from inside a vectorised loop (unless they are static inline) is simply too hard a problem to consider let alone write an ABI for.

if this wisdom is accepted then the problem entirely disappears.  VL does not need saving (except by traps / contextswitching which is not the program's responsibility)

remember that this is a Sub-Program-Counter we are talking about.

the idea of letting the compiler "loose" on VL is one that us effectively equivalent to allowing "gotos" within a sub-function.  it's a similar level of insanity.
Comment 27 Luke Kenneth Casson Leighton 2020-12-30 21:05:48 GMT
(In reply to Alexandre Oliva from comment #25)
> ok, good, set from register was one of the pieces that seemed to be missing
> to me.  

it's the fundamental core basis of Cray Vectorisation.  it is critically important to understand this.

i recommend reading and re-reading the DAXPY example in the sigarch "SIMD Considered Harmful" article again and again until it is clear.

setvl takes a register as the primary "wanted, total" number of "things" to be processed.   10,000 DAXPY loops.  50,000 DAXPY loops.

the hardware says, "actually MAXVL is set to only 8.   so although you requested 50,000 i am actually only going to set VL to 8".

a copy of that value of VL is also placed into RT.

the DAXPY operations get done: because VL=8 only 8 madds get done.

then the loop counter which was formerly 50,000, has RT subtracted from it.

it is now 49992.

not zero, so round the loop we go.

setvl receives a request to set VL equal to 49992.  it goes, "you requested 49992.  MAXVL is set to 8.  i can only give you 8.  therefore VL and RT will be set to 8.  have a nice day"

etc etc until finally the loop gets down to e.g. 6 or 3 or 2

at which point, setvl says, "you requested i set VL equal to 2.  this is less than 8.  you get what you asked for.  VL and RT are now 2.  have a nice day"

and finally, loop counter minus 2 equals zero, this zero is detected and the loop is exited.

this is how Cray Vectors have worked since the 70s, and it is awesome.  massive hardware could do 10,000 elements in one instruction.  you ask for VL=10,000 you damn well GOT 10000.  the hardware was however insane and needed special off-chip multi-ported SRAMs *just for the register file*, gallium arsenide and liquid gas cooling, it was mental.
Comment 28 Jacob Lifshay 2020-12-30 21:11:20 GMT
(In reply to Luke Kenneth Casson Leighton from comment #26)
> (In reply to Jacob Lifshay from comment #24)
> 
> > I'm inclined to define two different ABIs where one has VL be clobbered by
> > calls and the other has it be saved by calls, though I'm not sure which
> > should be default.
> 
> the generally-accepted industry-standard wisdom here is that making function
> calls *at all* from inside a vectorised loop (unless they are static inline)
> is simply too hard a problem to consider let alone write an ABI for.
> 
> if this wisdom is accepted then the problem entirely disappears.  VL does
> not need saving (except by traps / contextswitching which is not the
> program's responsibility)

that works for traditional loop vectorization but not really for full function vectorization which is needed for graphics shaders. in graphics shaders, VL is set at the start of the shader and not changed, the whole shader program (potentially many functions and thousands of lines of code) is all inside the vectorized loop. that is part of why I'm planning on having the vectorization pass in the shader compiler where the IR is kept in a form that is vectorizable by construction (no irreducible control-flow-graphs).
Comment 29 Luke Kenneth Casson Leighton 2020-12-30 21:23:08 GMT
(In reply to Alexandre Oliva from comment #23)
> today's conversation about VL=0 got me thinking that VL is dynamic, which
> means the compiler may have to be able to mv from and to it.  a setvli insn
> that sets VL to MAXVL, or even to a different immediate, won't quite cut it.

correct.  see
https://libre-soc.org/openpower/sv/setvl/

you will see that taking in a register RA 
 is possible, as is getting a copy out of VL, in RT.
 
> This means it may not be enough for setvli and some privileged
> register-restore for interrupts to be able to set vl from an immediate or
> from memory: the compiler has to be able to save it and restore it elsewhere.

the complexity involved is so great that absolutely nobody has tried this.

they simply set the rule, "no function calls allowed except static inline (or scalar ones that do not *in any way* touch vector state).

 
> This even relates with calling conventions: are VLI and VL supposed to be
> preserved across function calls, so a function that uses them has to save
> the state and restore it before returning, or are the call-clobbered and
> it's up to the caller to preserve them?

no.  generally the rule is: function calls are simply prohibited.

> Having the vector state represented as a call-preserved register (set) would
> even enable the (userland) exception handling machinery to restore the
> register while unwinding the stack.

...  interesting.  i mean, this is theoretically possible: SVSTATE is after all a user-accessible SPR.

however please understand that exceptions, even userland ones, *have* to switch to an alternative SPR for SVSTATE.

it should be clear that allowing any exception to have the Sub-PC set to some arbitrary value at the beginning of the trap is an absolute 100% guaranteed way to get data corruption.


> Though having it preserved by the caller might also make a lot of sense,
> since the caller might not even have to save it, and instead just know how
> to set it back up.

honestly this is an entire area of active research that has not been solved, merely avoided entirely by setting some strict rules (no extern function calls allowed)
Comment 30 Luke Kenneth Casson Leighton 2020-12-30 21:29:23 GMT
(In reply to Jacob Lifshay from comment #28)

> that works for traditional loop vectorization but not really for full
> function vectorization which is needed for graphics shaders. in graphics
> shaders, VL is set at the start of the shader and not changed, 

yes that's a completely different specialised area, which does not need to interact with general purpose code that it cannot be determined whether the callee *or one of its sub-functions* is going to use (change) VL.

remember that MAXVL is also part of the SVSTATE SPR.  if we do this at all my feeling is it should be caller-saved.

i am reluctant to get into the topic of defining ABIs however, at this early stage.
Comment 31 Alexandre Oliva 2020-12-30 22:52:13 GMT
luke, knowing what was or wasn't available in the cray implementation wouldn't tell me something was part of our plans.  I had happened to come across setvli, but I had not seen any mention (that I recall) of setvl, that's all.

also, the output of setvl is not really what I was interested in, though the responses have made me realize that, if we're to model VL in GCC at all, we can't just assume it will hold whatever value we store in it.

what I was interested in was the dynamic setting of vl after a fail-on-first.  the compiler may have to save it and restore it even in the absence of function calls.  an intervening use of a different vector may require a setvl, and then, when we go back to the original vector, we'd better restore the vl that it had at the end, rather than assume it's the same as in the beginning.

now, you could say "don't do that", but I'm just trying to model things in the compiler, and once you model vl as a register in the compiler, and state the register has to be set before an instruction can operate on this object as a vector, there is a possibility it will vectorize stuff in a loop that calls a function that happens to use a different vector type, and even that the function is inlined.  as an ISA designer, you may not expect this sort of behavior, but as a compiler writer, I expect arbitrary code to be thrown at it, so I have to take these possibilities into account.


now, the mention of setting sub-pc is confusing when all I'm asking for is a means to load and restore the vl.  I suppose the mention was because they're all expected to be represented in the same special-purpose register, and an interrupt handler may very well have to preserve it, restore it, and even attempt to tinker with it; so might a signal handler.  but I was talking about userland exception handling.  think setjmp/longjmp, not iret.  you won't longjmp into the middle of the execution of a vector-prefixed instruction, indeed, but you might want to restore the maxvl and the vl to those of the enclosing context.  and, indeed, one could argue a signal handler *must* do that whether it returns to the point of execution, or it raises an exception.  some languages  rely heavily on raising exceptions for certain kinds of error recovery (Ada comes to mind), and that's often implemented as raising exceptions from within signal handlers.  now, if the compiler were to assume that at an exception landing pad the vector state is preserved from what it set up before a loop (without any function calls), and we don't preserve it, that would be a problem.  consider:

  setvli
  for (int i = 0; i < e; i += VL) {
    try:
      __builtin_vector_intrinsic(...); // not a function call
      __builtin_vector_intrinsic(...); // not a function call
      __builtin_vector_intrinsic(...); // not a function call
    except:
      log the problem, recover or skip, then continue on to the next iteration
  }

do you see the problem of not restoring vector state there?

I assume you do.  now let's take this one step further.  consider any of these intrinsics actually modify VL (e.g. fail-on-first); assume VL is an asm register variable or a macro that reads from the VL part of the special-purpose vector status registers somehow.

after we continue, we add VL to the loop counter.  it may have been modified within the loop body, or even in an earlier loop iteration.  do you see the problem if the exception raised out of the signal handler doesn't restore the VL that was in effect, and instead leaks unrelated MAXVL and VL, set up within the signal handler, to the exception handler within your vector loop?
Comment 32 Jacob Lifshay 2020-12-30 22:58:44 GMT
the way I was imagining it would work is that VL is a register that can be register allocated like normal, copy from VL would be mfspr, copy to VL would be setvl with MAXVL=64, and there would be a separate setvl compiler op that ttanslates to setvl with MAXVL set to the length of the vector types in the vectorized code (or just setvli to set VL to MAXVL).
Comment 33 Luke Kenneth Casson Leighton 2020-12-31 01:45:16 GMT
(In reply to Alexandre Oliva from comment #31)
> luke, knowing what was or wasn't available in the cray implementation
> wouldn't tell me something was part of our plans.  I had happened to come
> across setvli, but I had not seen any mention (that I recall) of setvl,
> that's all.
> 
> also, the output of setvl is not really what I was interested in, though the
> responses have made me realize that, if we're to model VL in GCC at all, we
> can't just assume it will hold whatever value we store in it.

correct.  it stores the value that is defined by what the setvl instruction does.  and that behaviour is very specifically defined, in such a way that yes, the compiler may model it.

however...

again i do have to stress that this is seriously, seriously out of scope for this bugreport.  yes it is part of the needs of a project that is approximately one year (minimum) into the future from now, however as far as this particular proposal is concerned, it is actually a major distraction, because it's only "autovectorisation" that needs to have gcc understand what VL is.

for "stage 1" we will need a *modest* understanding of VL - an intrinsic function (one of the very few) that can get RT stored into a size_t variable (exactly as is done in the rvv gcc patch).

but as far as "gcc auto-recognising the concept of VL such that it may perform loop auto-vectorisation", no that is *definitively* part of "stage 2" and is 100% out of scope.


> what I was interested in was the dynamic setting of vl after a
> fail-on-first. 

again: i stress that although i am happy to answer, it's critically important that you realise and accept that those answers are 100% out of scope for this bugreport and for the proposal involving crypto-primitives and "just aboe bare-metal"


>  the compiler may have to save it and restore it even in the
> absence of function calls.  an intervening use of a different vector may
> require a setvl, and then, when we go back to the original vector, we'd
> better restore the vl that it had at the end, rather than assume it's the
> same as in the beginning.

the general rule is that any given loop uses VL for the purposes of that loop, and nothing else.  if there is a separate loop, it is a separate VL and the two absolutely do not mix.

VL within the context of a given loop therefore absolutely and exclusively applies solely to the vectors within that loop.

*at no time* will there be a different (overlapping) VL applying to an unrelated vector, by which "return" to an "original" vector is proposed or considered.

so - no: save and restore of VL within a loop - with or without fail-first - is just something that "is not done".

now, that said: if you have *nested* vector loops, then yes, i would say that saving and restoring of VL (and MAXVL, remember) would be reasonable.

again, however, i stress that that is *strictly* "phase 2", and at the "phase 1 just above bare metal" level, it would be the responsibility of the *developer* to perform the necessary saving/restoring, explicitly and directly.


> now, you could say "don't do that", but I'm just trying to model things in
> the compiler, and once you model vl as a register in the compiler, and state
> the register has to be set before an instruction can operate on this object
> as a vector, there is a possibility it will vectorize stuff in a loop that
> calls a function that happens to use a different vector type, and even that
> the function is inlined. 

again: this is very much out of scope, an active area of cutting-edge research that could literally absorb any one of us for a minimum of 12-18 months, all on its own.

it is also specifically why many Vector Compiler ISA writers specifically prohibit the calling of external functions.

the inlining of functions, i don't even know where to begin, there, to answer, and that should in itself give you a "red flag" that, with my broad expertise, if i don't know an answer - or even where to look - then it's *going* to be a hard problem that involves weeks if not months of active research.




> as an ISA designer, you may not expect this sort
> of behavior, but as a compiler writer, I expect arbitrary code to be thrown
> at it, so I have to take these possibilities into account.

for stage 2 - some time at least a minimum of 1 year into the future: yes.

for stage 1 - "just above bare metal" - no.  the scope is limited to the register allocation table, __attribute__(vec), adding a setvl intrinsic, and push/pop of a MAXVL context.  oh, and adding the bitmanip/crypto-primitives as intrinsic functions.  that's pretty much it.

this "stage 1" which is startlingly similar to gcc SIMD auto-vectorisation will *provide* you with the understanding and experience *to* do the "stage 2"...

... but until stage 1 is completed the risk is that we won't even get to "stage 2" because we can only apply - just - for funding from NLnet for stage 1.


> 
> now, the mention of setting sub-pc is confusing when all I'm asking for is a
> means to load and restore the vl.

ok, i'm happy to answer - bear in mind that this is for stage 2 (which isn't going to get funded for at least a year).  it's very easy: get the copy of the SVSTATE SPR, push it on the stack.   this will save both VL and MAXVL both at the same time.


>  I suppose the mention was because they're
> all expected to be represented in the same special-purpose register, and an
> interrupt handler may very well have to preserve it, restore it, and even
> attempt to tinker with it; so might a signal handler.  but I was talking
> about userland exception handling.  think setjmp/longjmp, not iret. 

again: this is stage 2 thinking.  we're a minimum 1 year away from that and have no available funding for stage 2.

it's good that you're thinking about it... but until stage 1 is done we can't get to it.

> [....]
> do you see the problem of not restoring vector state there?

yes... and it is something that should be solved for stage 2 with a minimum budget of EUR 50,000 to 100,000, which given that i made the mistake of cancelling the NLnet gcc Grant request, we need to find some other way.


> 
> I assume you do.  now let's take this one step further.  consider any of
> these intrinsics actually modify VL (e.g. fail-on-first); assume VL is an
> asm register variable or a macro that reads from the VL part of the
> special-purpose vector status registers somehow.

for "stage 1" this is not a problem at all because the code writer is assumed to have a full and complete understanding of SV, and knows to expect ffirst to be modified.

for stage 2 - which is completely out of scope for the purposes of putting in a "stage 1" grant request, for supporting of crypto-primitives - it can be solved.

> 
> after we continue, we add VL to the loop counter.  it may have been modified
> within the loop body, or even in an earlier loop iteration.  do you see the
> problem if the exception raised out of the signal handler doesn't restore
> the VL that was in effect, and instead leaks unrelated MAXVL and VL, set up
> within the signal handler, to the exception handler within your vector loop?

for stage 2 - which, again, is completely out of scope for the purposes of this discussion, it would indeed be the responsibility of the compiler to understand VL and ffirst, because it would no longer be the developer's direct responsibility.

however, again: i emphasise: we do not have funding for stage 2, and we cannot get it from NLnet.  the PET Programme ended on Dec 1st 2020.  we can *only* apply for funding under the NLnet "Assure" Programme which specifically requires a cryptographic focus. (or, there is also the NGI0 Programme which we *might* be able to apply under "Internet Search".  i will have to investigate and think about it).

i have spoken to Michiel and he agrees 100% that writing cryptographic routines in pure assembler is flat-out madness and unacceptable.  they'll be unreadable, unreviewable, and consequently lead to disastrous security.

i was then able to explain to him if that if we were able to write at least in c-code, still requiring an understanding of SV, "just above bare metal", this will at least result in readable code [no trying to transfer between ints and assembly regs using "asm" statements].

(also that Lauri's task would be a lot easier)

then we get to use the crypto-primitives as an excuse to write unit tests of the SV hardware (and simulator) and to "prove" SV... but *WITHOUT* needing to go the whole way of a full auto-vectorisation compiler that would normally cost USD 250,000 to 1,000,000 in funding to complete.
Comment 34 Alexandre Oliva 2020-12-31 02:14:43 GMT
so what you seem to be telling me is that I should disregard the knowledge I have of what the compiler *will* do, and of the least that *needs* to be modelled in order to get us the reliable functionality we want, and instead do what you think is the bare minimum that will get us a high-level assembler as long as you don't breathe too hard while compiling the code, and if the compiler still breaks stuff because we haven't modelled the bare minimum I know we should, then, well, screw the user...

can I do that?  sure.  is that what I think we should do?  definitely not, unless you want it to not really work, and then to throw it away when we get to the second stage.

now, one of us is a compiler engineer who wrote a report on all the GCC internal transformations and optimizations not very long ago.

the other is someone who has a lot of exceptionally important pieces of knowledge to make this project work, including of what we'd like to have for a first-step implementation to apply for a grant.

what we have is a mismatch between the assumptions made by one person and the reality of what needs to be done to make this first step a step in the right direction, rather than too-fragile-to-use throw-away code.

knowing the internals as I do, I want to figure out where we're aiming to get in the end, so that I can *then* figure out what we need to get the feature you wish to apply for a grant on that will take us closer to the end goal.

arguing we don't wish the compiler to do things that it will normally do doesn't make the job easier as you seem to think; it makes it harder, because then we'll have to do the job AND stop the compiler from doing what it normally does.  is that understood?
Comment 35 Luke Kenneth Casson Leighton 2020-12-31 03:57:53 GMT
(In reply to Alexandre Oliva from comment #34)

> knowing the internals as I do, I want to figure out where we're aiming to
> get in the end, so that I can *then* figure out what we need to get the
> feature you wish to apply for a grant on that will take us closer to the end
> goal.

yehyeh, i was just thinking (half asleep, 2am), "hmm i replied point-by-point rather than stepping back and thinking, i should followup and point out that i do understand that you're looking to plan ahead".

> arguing we don't wish the compiler to do things that it will normally do
> doesn't make the job easier as you seem to think; it makes it harder,
> because then we'll have to do the job AND stop the compiler from doing what
> it normally does.  is that understood?

yes, it is.  appreciated you pointing this out.

i need to do a code walkthrough of gcc.  it's been some time since i last did that.
Comment 36 Luke Kenneth Casson Leighton 2021-01-11 05:00:56 GMT
ok so to recap.

* VSX is SIMD system with a fixed byte width and no support for predication
* SV is a Cray-style vaiable length Vector system with dynamic variable register allocation and advanced predication
* we are NOT going to try to "map" SV onto VSX at the hardware level: VSX, being based on SIMD, is a known harmful concept.
* we ARE going to implement SIMD at the BACKEND:  this will NOT be visible to the user.  the user will ONLY interact with the SIMD ALUs via the Variable-Length SV ISA.

the mismatch between those two means that any efforts to try to examine the VSX OpenPOWER support in gcc and "adapt" it to SV are also pretty much 100% guaranteed to fail.  even examining the VSX support will lead to confusion and misleading ideas.

* VSX with its fixed 16 byte range may only be matched by loop patterns that match exactly the 16 bytes.  only 2 DWORDs fit into 16 bytes but 4 WORDs fit.
* SV has a variable vector length where MVL is specified in ELEMENTs.  MVL=8 takes 8 64bit registers for 8 64 bit computations, 4 64bit registers for 8 32 bit computations, down to only ONE 64bit register for 8x 8bit computations.

* just as in the SIMD Considered Harmful article VSX end of loop is forced to engage in insidious "cleanup" dealing with less than the power-of-two
* SV the loop deals with it.

thus, Alexandre, can you see, based on the understanding you've gained over the past week (we are not mapping to VSX internally, MVL sets an allocation of regs for use by VL) that there is virtually nothing within the gcc support for VSX that can be used?

i.e. that the expectation you had, that the direction i was describing, would be a waste of time and need throwing away, is flawed?

rather, it is the entire VSX SIMD code that needs to be disregarded, and other backends examined for clues:

* AVX512 (because it has predication)
* SVE2 if it's landed because likewise,
  and it has fail-first on LD.
* RVV because it is the first modern
  Cray Vector ISA in several decades.

now, whilst i very much wanted the following type of construct due to its startling similarity to the vector loop optimisation:

   __attribute__(SV.sat, etc) {
        statements
        ...
   }

it may instead be enough to carry the attributes on types, variables and functions.  place the saturation attribute on the variable, and on assignment from another variable saturation occurs.

also interestingly there already exists a vector_size attribute which LITERALLY directly maps to MVL:
 https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html#Vector-Extensions

my feeling is that rather than add several tens of thousands of intrinsics autogenerated (the software equivalent of propagating "SIMD considered harmful"), __attribute__ could well provide the means to bring in SVP64 Context, with very little needed to be added with the exception of setvl.

of course, unusual intrinsics yes: the new set-before-first opcode, the integer-max opcode etc etc these will all need adding...  as *scalar* opcodes.  later the vector-based ones such as conflictd and so on.

attribute can add:

* subvl
* indicate which variable is to be a predicate source
* saturation
* even mapreduce mode

interestingly it would not introduce elwidth overrides, these would be derived from the types uint64_t interacting with uint32_t etc etc.
Comment 37 Jacob Lifshay 2021-01-11 05:37:11 GMT
(In reply to Luke Kenneth Casson Leighton from comment #36)
> my feeling is that rather than add several tens of thousands of intrinsics
> autogenerated (the software equivalent of propagating "SIMD considered
> harmful"), __attribute__ could well provide the means to bring in SVP64
> Context, with very little needed to be added with the exception of setvl.

I think we should have intrinsics for saturated/not add, sub, mul, etc., since gcc intrinsics can be generic over the type such that one intrinsic works on all the different combinations of scalar/vector types. That way, we'd only need something like 20-30 intrinsics to cover all the weird and wonderful alu/load/store ops, not hundreds or thousands.

Example:

typedef int16_t MyVec[45] __attribute__((sv_vector));

void mix_audio(int16_t a[], int16_t b[], int16_t dc_balance, size_t len) {
    while(len > 0) {
        size_t vl = __sv_setvl<MyVec>(len);
        MyVec v_a = __sv_load<MyVec>(a, vl);
        MyVec v_b = __sv_load<MyVec>(b, vl);
        MyVec temp = __sv_sat_sub(v_a, dc_balance, vl)
        temp = __sv_sat_add(temp, v_b, vl);
        __sv_store(a, temp, vl);
        a += vl;
        b += vl;
        len -= vl;
    }
}
Comment 38 Luke Kenneth Casson Leighton 2021-01-11 14:42:02 GMT
(In reply to Jacob Lifshay from comment #37)
> (In reply to Luke Kenneth Casson Leighton from comment #36)
> > my feeling is that rather than add several tens of thousands of intrinsics
> > autogenerated (the software equivalent of propagating "SIMD considered
> > harmful"), __attribute__ could well provide the means to bring in SVP64
> > Context, with very little needed to be added with the exception of setvl.
> 
> I think we should have intrinsics for saturated/not add, sub, mul, etc.,
> since gcc intrinsics can be generic over the type such that one intrinsic
> works on all the different combinations of scalar/vector types. That way,
> we'd only need something like 20-30 intrinsics to cover all the weird and
> wonderful alu/load/store ops, not hundreds or thousands.

conceptually there is very little difference to this:

// use the existing gcc vector_size attribute
typedef int16_t MyVec[45] __attribute__((vector_size=45));

typedef int16_t MyVecSat[45] __attribute__((vector_size=45, sv_satsigned=8bit));
 
 void mix_audio(int16_t a[], int16_t b[], int16_t dc_balance, size_t len) {
     while(len > 0) {
         size_t vl = __sv_setvl<MyVec>(
len);
         MyVec attribute(sv_vl=vl) v_a = a;
         MyVec attribute(sv_vl=vl) v_b = b;

         MyVecSat temp = v_a - dc_balance;
         temp += v_b;
         ...
         ...

can you see that those are functionally directly equivalent, one requires intrinsics, the other does not? (except setvl)

the creation of *any* explicit intrinsics effectively disregards the entire SV concept which is that it is context-based, losing a golden opportunity in the process.

the clue was in "how easy it would be to autogenerate" being literally a set of nested for-loops with very few exceptions of combinations that won't work.

wherever that is possible you have two choices:

1) create the mass of permutations
2) leave it up to a "context" which is intelligently applied and understood (this is how SE/Linux works btw)

i have made the mistake of (1) twice now.  the first time was in Samba TNG, the second time was with 30,000 autogenerated functions in python-webkit.

the result was a massive binary executable size increase, where microsoft, by having a dynamically runtime interpreted type library, managed to create tiny binaries that only needed one library to handle absolutely every single MSRPC interaction.

if we were doing a "standard" Vector ISA with large holes in the potential permutations (because the combinations spiral out of control just as they do for SIMD) i would agree with you immediately, jacob: an autogenerated by-the-numbers set of intrinsics would be the only sane way to do it.

the fact that RVV has added specialist this, specialist that, only one set of reduce operations for example (where we hace abstracted it), there is no other option for them.

we on the other hand have abstracted out even saturation as a general concept.  mapreduce likewise.  REMAP likewise.  REMAP can be applied even to vectors of fcmp or vectors of bpermd.

and here is the kicker: all these abstractions apply to *future* instructions that we can't yet envisage.  add even one new instruction and that entire autogenerated intrinsics table has to be redone.

by contrast the "patch" surface for adding any new scalar instruction if we use attributes is extremely small.  it might even be possible to do as macros wrapped around inline asm blocks, zero soure modifications to gcc *at all*


additionally: having a frontend generate strings (or AST) that does "sv_satu_ew8_subvl3_add()", that then has to be de-flattened and decoded when creating svp64 prefixes.

that alone is a hell of a lot of CPU power *and* a lot of work.

or

attribute(svsatu, subvl3, ew8)

this not only fits into existing gcc data structures it requires far, far less code to turn into an svp64 RM24 context.  i mean, it *is* the svp64 context.

nothing about the explicit creation of intrinsic datatypes and funtions looks good to me.  by complete contrast using attributes carries type context that meshes pretty much directly with svp64.
Comment 39 Luke Kenneth Casson Leighton 2021-01-11 21:47:36 GMT
right.  ok.  one area where interaction with the __attributes__ concept, which is not going to fit precisely is: CR-based predication.

the handling, transfer in and out between INTs and CRs is one that may need explicit intrinsics.  the crweird opcodes for example will need a data template that is recognisably vectoriseable, bear in mind all the crweird instructions have been designed as *scalar* not vector, but that the vectorised versions will need special-casing

https://libre-soc.org/openpower/sv/cr_int_predication/

a suitable data structure will need to be defined that can recognise CRs (if one does not already exist in gcc-ppc64), i would imagine something along the lines of a struct containing 4 bools, eq, le, lt, so

when an array of such is marked with the standard gcc attribute vector_size this would give a variable that can be passed around as a cr predicate, and also passed in as a parameter to the crweird intrinsics.

how does gcc handle CRs in pp64 right now?
Comment 40 Jacob Lifshay 2021-01-11 22:00:18 GMT
I think that CRs should be handled entirely by the gcc backend, they will have mask values allocated to them automatically. In the gcc frontend, masks are just bitvectors (basically a uint64_t with a different type).

masks are generated by using compare intrinsics and by using bit-wise logic ops on masks. masks can be cast to/from integers.

the cr logic ops and the cr from int and cr to int ops are generated automatically by the instruction selector and by the register allocator.
Comment 41 Jacob Lifshay 2021-01-11 22:04:12 GMT
(In reply to Jacob Lifshay from comment #40)
> I think that CRs should be handled entirely by the gcc backend, they will
> have mask values allocated to them automatically. In the gcc frontend, masks
> are just bitvectors (basically a uint64_t with a different type).
> 
> masks are generated by using compare intrinsics and by using bit-wise logic
> ops on masks. masks can be cast to/from integers.
> 
> the cr logic ops and the cr from int and cr to int ops are generated
> automatically by the instruction selector and by the register allocator.

this should be relatively easy since all you need to do is add the proper instructions for the generic register move ops generated by the register allocator:
move bitvector from int to cr
move bitvector from cr to int
Comment 42 Luke Kenneth Casson Leighton 2021-01-11 22:49:32 GMT
(In reply to Jacob Lifshay from comment #40)
> I think that CRs should be handled entirely by the gcc backend, they will
> have mask values allocated to them automatically.

in the "final" version (the final target) i could not agree more. results (all results) can have a CR associated with them: when Vectorised this becomes a Vector of associated CRs and i would anticipate there not being much in the way of changes to gcc-ppc64 scalar code to transform it to support "scalar is now vector yes even CRs"

we are however at a weird state where for the minimum amount of work we need an interim step that at least gives developers such as Lauri, plus the cryptoprimitives Grant proposal, a level just above assembler that does not require 2 years of gcc development before they can even begin work.

to expect gcc to automatically support vectors-of-CRs unfortunately does exactly that: places significant gcc work smack on the critical path.  

> In the gcc frontend, masks
> are just bitvectors (basically a uint64_t with a different type).

ah that might work.  ahh i know.  each CR Vector is considered to be a Vector of eq bits, a Vector of lt bits etc.

the uint64_t with different type you mention is:

   bit 0 is CR0.eq
   bit 1 is CR1.eq
   etc

then in the crweirder intrinsic it takes these parameters:

   * struct of 4 uint64_t named eqltgtso
   * same as dest   
   * mode
   * mask

(or, just 10 arguments, 4x uint64_t src, 4x dest)

this sounds very much like what Tim Forsyth was talking about.  design the ISA so that the compiler can understand it.


 
> masks are generated by using compare intrinsics and by using bit-wise logic
> ops on masks. masks can be cast to/from integers.
> 
> the cr logic ops and the cr from int and cr to int ops are generated
> automatically by the instruction selector and by the register allocator.

it woule be spectacularly weird to be striping the CR eq/so/lt/gt bits across
Comment 43 Luke Kenneth Casson Leighton 2021-01-11 23:06:17 GMT
(In reply to Jacob Lifshay from comment #41)
> (In reply to Jacob Lifshay from comment #40)

> > the cr logic ops and the cr from int and cr to int ops are generated
> > automatically by the instruction selector and by the register allocator.
> 
> this should be relatively easy since all you need to do is add the proper
> instructions for the generic register move ops generated by the register
> allocator:
> move bitvector from int to cr
> move bitvector from cr to int

if the CR masks are striped down all the eq bits, all the lt bits etc. this meshes well.  the register allocator becomes a weird bitlevel allocator but hey.

i would very much prefer that interactions between the cr mask intrinsics *not* require tight integration and understanding in gcc other than in the regfile allocator (keep gcc off the critical path)

that implies keeping *away* from CR0-7 as much as possible, so that use of a crweird intrinsic doesn't completely destabilise scalar code.  once we have bootstrapped up, got VC funding, this can be revisited.
Comment 44 Jacob Lifshay 2021-01-11 23:38:50 GMT
(In reply to Luke Kenneth Casson Leighton from comment #42)
> (In reply to Jacob Lifshay from comment #40)
> > I think that CRs should be handled entirely by the gcc backend, they will
> > have mask values allocated to them automatically.
> 
> in the "final" version (the final target) i could not agree more. results
> (all results) can have a CR associated with them: when Vectorised this
> becomes a Vector of associated CRs and i would anticipate there not being
> much in the way of changes to gcc-ppc64 scalar code to transform it to
> support "scalar is now vector yes even CRs"

I meant that the concept of CRs doesn't show up at all in the compiler frontend/ir (except for inline assembly) and only appears when the generic ir is translated to PowerPC-specific code in the backend as part of instruction selection and register allocation. That's how CRs are currently handled in LLVM anyway.

Masks are just vectors of bits (conceptually a single bit per element), and the compiler frontend only ever sees the integer register representation.

The CRs are treated as a single vector, not a group of 4 vectors.

This matches with the current compiler-level representation of a CR register which I've deduced is just a single bit per CR field (not 4), along with a statically-calculated by the instruction-selector indication of which bit to use:

int a < b produces:
cmp dest_cr, a, b
with dest_cr.lt being the bit to use

int a <= b produces:
cmp dest_cr, a, b
with !dest_cr.gt being the bit to use

int a == b produces:
cmp dest_cr, a, b
with dest_cr.eq being the bit to use

int a != b produces:
cmp dest_cr, a, b
with !dest_cr.eq being the bit to use

int a > b produces:
cmp dest_cr, a, b
with dest_cr.gt being the bit to use

int a >= b produces:
cmp dest_cr, a, b
with !dest_cr.lt being the bit to use

float a < b produces:
fcmpu dest_cr, a, b
with dest_cr.lt being the bit to use

float a <= b produces:
fcmpu dest_cr, a, b
cror dest_cr.eq, dest_cr.lt, dest_cr.eq
with dest_cr.eq being the bit to use

float a == b produces:
fcmpu dest_cr, a, b
with dest_cr.eq being the bit to use

float a != b produces:
fcmpu dest_cr, a, b
with !dest_cr.eq being the bit to use

float a > b produces:
fcmpu dest_cr, a, b
with dest_cr.gt being the bit to use

float a >= b produces:
fcmpu dest_cr, a, b
cror dest_cr.eq, dest_cr.lt, dest_cr.eq
with dest_cr.eq being the bit to use


Collisions with other instructions that use CRs is easily handled by the register allocator, all we need to do is tell it which registers overlap with which other registers and it handles all the rest without needing modification (except for supporting allocating ranges of registers instead of single registers).
Comment 45 Jacob Lifshay 2021-01-11 23:45:21 GMT
Adding support for CRs to the front end and mid-level IR would be more complex than just adding the required support to the backend -- the backend changes required to support automatically copying between CRs and integers would be needed anyway for the simple MVP.
Comment 46 Luke Kenneth Casson Leighton 2021-01-12 00:49:44 GMT
appreciated the info jacob.  as a reminder: the purpose of this bigreport is not for the solicitation of USD 250,000 and above to support the full required feature set that will take 2 years to complete.

this is not happening: any efforts we make to focus on that and or depend on that are guaranteed to jeapordise the entire project.

we specifically need somethimg that can be used for the next 8-12 months that requires the absolute minimum work that, at just above assembly level provides developers, unit test and HDL writers and more with the absolute bare minimum needed to do their jobs.

if it gets thrown away in 8-12 months *we do not care*, it will have served its purpose.

on that basis a *temporary* intrinsic representation of CR-mask-vectors, one that is explicit and allows just-above-assembly to be explicitly written, fits the reality of what we can get funding for, where full gcc support cannot.

fitting in with the CR register as a straight linear bitvector, if the alternate CR mask intrinsic representation avoids the 1st 8 CRs it kinda effectively becomes a separate regfile.  we *call* it "the same regfile" but conceptually it is not.

in this way the fact that CR0-7 is seen as CR0.eq CR0.lt...  CR1.eq... where the Vector versions are CR0.eq CR1.eq CR2.eq is irrelevant effectively and resolved by the register allocator which requires a very weird modulo 4 bitlevel view of CRs.

at the point where the USD 250k funding needed for "full" gcc support is needed including in the frontend all this can be reevaluated.

therefore the frontend is declared - aside from the explicit intrinsics - completely out of scope.

right now - with some considerable urgency - we need to focus on what can be achieved with 10-14 weeks work absolute max, around EUR 10-12k.

how can this be done?
Comment 47 Luke Kenneth Casson Leighton 2021-01-12 01:39:30 GMT
hmm. i have a thought.  bear with me.

* the goal is to get working vectorised assembly with the absolute bare minimum modifications to gcc
* a suite of scalar code, making use silently of CRs, should therefore also work once vectorised.... *without modifying gcc*
* therefore when any one line of scalar code is marked as "vectorised" the CR operations behind it must *also mirror the same behaviour without modification*

thus any code that creates a CR or moves a CR must have the EXACT same svp64 prefix and behave exactly the same as the scalar CR version.

this then *defines* how we must number and lay out the Vectorised CRs.

namely: the numbering - sigh - needs to be in columns, not rows.

  CR0 CR1 CR2  CR3  CR4  CR5 CR6 CR7
  CR8 CR9 CR10 CR11 CR12

when Vectorised the increments 0..VL-1 go CR0 CR8 CR16 CR24 **NOT** repeat **NOT** CR1 CR2 CR3 CR4

this was the "Matrix" idea that i outlined woukd be absolute hell to implement the DMs for.

sigh.

however it would ensure that for scalar code that is created with scalar CRs, CR0 to CR7 being ANDed and ORed and etc etc, when the integer expression that generates CR0 gets Vectorised then as long as all CR operations associated with CR0 are also Vectorised they simply propagate the attribute s/v it is *not* necessary to do a massive redesign of gcc.

i hope i am making sense here.

basically we customise the hardware to suit gcc, not the other way round.  that's what Tim Forsyth was on about.

now, the only problem is: 64 CRs results in wrapping far too quickly.

CR8 CR16 CR24 CR32 CR40 CR48 CR56 whoops we have to go to CR0 next.

this places an artificial limit on the length of MAXVL that can be used without serious modifications to gcc.

if however we increase to 128 CRs then MAXVL can go up to 16 without wrapping when "nominally scalar" code, referring to CR1 and not knowing it's a Vector, actually operates on 16 CRs CR1 CR9 CR17 ... CR(1+15*8)

being able to do Vectors up to 16 in length with zero significant code modifications to gcc and yet still be able to write just above bare metal assembler *and* not need USD 250k VC funding is a pretty damn good deal.
Comment 48 Jacob Lifshay 2021-01-12 02:20:06 GMT
If we want something we can just throw away later, what about writing the bare minimum to support vectors in inline assembly with register allocation and just write a C++ class that wraps inline assembly? With inlining, it should be quite efficient. I could write the C++ class in a few days of work.

This would require compiling code that uses SV in C++ mode, which shouldn't be that difficult to achieve.

We could have the backend abort if it detects any functions left over that try to pass vectors as arguments or return values, making it so we wouldn't need to implement a function call ABI.
Comment 49 Luke Kenneth Casson Leighton 2021-01-12 05:13:34 GMT
(In reply to Jacob Lifshay from comment #48)
> If we want something we can just throw away later, what about writing the
> bare minimum to support vectors in inline assembly with register allocation
> and just write a C++ class that wraps inline assembly?

yyyeahhh i like it. it's not exactly what i had in mind for this bugreport, however i can see it has value above trying to write in bare .S files.

we still need something that is a half-way house (like the RVV intrinsic patch): not bare metal assembler, not full autovectorised gcc either.

> With inlining, it
> should be quite efficient. I could write the C++ class in a few days of work.
> 
> This would require compiling code that uses SV in C++ mode, which shouldn't
> be that difficult to achieve.

... it has some subtle implications.  c code compiled with g++ fails when -1 is placed into a uint, rather than just going, "oh you must have meant 0xffffffff i'll just take care of that for you".

have to ask Lauri if he's ok with it.

i can see it still being valuable though.
Comment 50 Jacob Lifshay 2021-01-12 07:12:07 GMT
(In reply to Luke Kenneth Casson Leighton from comment #49)
> (In reply to Jacob Lifshay from comment #48)
> > This would require compiling code that uses SV in C++ mode, which shouldn't
> > be that difficult to achieve.
> 
> ... it has some subtle implications.
yup.

> c code compiled with g++ fails when -1
> is placed into a uint, rather than just going, "oh you must have meant
> 0xffffffff i'll just take care of that for you".

That's not correct:
https://gcc.godbolt.org/z/n1vKEM

> have to ask Lauri if he's ok with it.

hope so, since it shouldn't be that hard to use... also, reading the source to see exactly which instructions are generated and/or adding new ones should be very easy compared to having to read through gcc's internals.

> i can see it still being valuable though.

+1
Comment 51 Luke Kenneth Casson Leighton 2021-01-12 13:50:03 GMT
(In reply to Jacob Lifshay from comment #50)

> > c code compiled with g++ fails when -1
> > is placed into a uint, rather than just going, "oh you must have meant
> > 0xffffffff i'll just take care of that for you".
> 
> That's not correct:
> https://gcc.godbolt.org/z/n1vKEM

it was 12 years ago.  i can't remember the details.
Comment 52 Luke Kenneth Casson Leighton 2021-01-13 18:07:38 GMT
sigh.  i updated the svp64 appendix to describe the CRs: it's a pig.  also rather than use an underscore it occurred to me that a decimal place "does the job" in an intuitive way.  the problem is that in a Vector of CRs some are *not accessible* as Scalars, a predicated mv is needed.  but if we want something that's doable for gcc without needing a year+ of work, this is how it goes.
Comment 53 Jacob Lifshay 2021-01-13 19:04:10 GMT
(In reply to Luke Kenneth Casson Leighton from comment #52)
> sigh.  i updated the svp64 appendix to describe the CRs: it's a pig.  also
> rather than use an underscore it occurred to me that a decimal place "does
> the job" in an intuitive way.  the problem is that in a Vector of CRs some
> are *not accessible* as Scalars, a predicated mv is needed.  but if we want
> something that's doable for gcc without needing a year+ of work, this is how
> it goes.

Congrats! Now, if you also apply something similar to that to int/fp registers, then you will have implemented basically what I proposed in bug #553.

One difference is int/fp registers (but not CR fields) go like (when counting in vector element order):
r0, r32, r64, r96,
r1, r33, r65, r97,
r2, r34, r66, r98 -- wrapping around after 4 instead of 8 registers.

for r<N> the element-order index = ((N & 0b11111) << 2) | ((N & 0b1100000) >> 5)

CR fields go like (matching what you described):
CR0, CR8, CR16, CR24, CR32, CR40, CR48, CR56,
CR1, CR9, CR17, CR25, CR33, CR41, CR49, CR57 -- wrapping around after 8 registers

for CR<N> the element-order index = ((N & 0b111) << 3) | ((N & 0b111000) >> 3)

Another difference is the special case for svp64 int/fp/cr extra2/extra3 decoding is removed -- they are changed to always decode like they currently do for scalars:
int/fp extra2:
| R\*\_EXTRA2 | Mode   | Range     | MSB down to LSB |
|-------------|--------|-----------|-----------------|
| 00          | Scalar | `r0-r31`  | `0b00 RA`       |
| 01          | Scalar | `r32-r63` | `0b01 RA`       |
| 10          | Vector | `r0-r31`  | `0b00 RA`       |
| 11          | Vector | `r32-r63` | `0b01 RA`       |
int/fp extra3:
| R\*\_EXTRA3 | Mode   | Range      | MSB downto LSB |
|-------------|--------|------------|----------------|
| 000         | Scalar | `r0-r31`   | `0b00 RA`      |
| 001         | Scalar | `r32-r63`  | `0b01 RA`      |
| 010         | Scalar | `r64-r95`  | `0b10 RA`      |
| 011         | Scalar | `r96-r127` | `0b11 RA`      |
| 100         | Vector | `r0-r31`   | `0b00 RA`      |
| 101         | Vector | `r32-r63`  | `0b01 RA`      |
| 110         | Vector | `r64-r95`  | `0b10 RA`      |
| 111         | Vector | `r96-r127` | `0b11 RA`      |
cr extra2:
| R\*\_EXTRA2 | Mode   | 7..5  | 4..2    | 1..0    |
|-------------|--------|-------|---------|---------|
| 00          | Scalar | 0b000 | BA[4:2] | BA[1:0] |
| 01          | Scalar | 0b001 | BA[4:2] | BA[1:0] |
| 10          | Vector | 0b000 | BA[4:2] | BA[1:0] |
| 11          | Vector | 0b001 | BA[4:2] | BA[1:0] |
cr extra3:
| R\*\_EXTRA3 | Mode   | 7..5  | 4..2    | 1..0    |
|-------------|--------|-------| --------|---------|
| 000         | Scalar | 0b000 | BA[4:2] | BA[1:0] |
| 001         | Scalar | 0b001 | BA[4:2] | BA[1:0] |
| 010         | Scalar | 0b010 | BA[4:2] | BA[1:0] |
| 011         | Scalar | 0b011 | BA[4:2] | BA[1:0] |
| 100         | Vector | 0b000 | BA[4:2] | BA[1:0] |
| 101         | Vector | 0b001 | BA[4:2] | BA[1:0] |
| 110         | Vector | 0b010 | BA[4:2] | BA[1:0] |
| 111         | Vector | 0b011 | BA[4:2] | BA[1:0] |

For LE cpu byte order:
the vsx/scalar-float registers 0-31 map the lower 64-bits to f0-31 and the upper 64-bits to f32-63.
the vsx registers 32-63 (altivec regs 0-31) map the lower 64-bits to f64-95 and the upper 64-bits to f96-127.
Comment 54 Jacob Lifshay 2021-01-13 19:12:50 GMT
(In reply to Jacob Lifshay from comment #53)
> Congrats! Now, if you also apply something similar to that to int/fp
> registers, then you will have implemented basically what I proposed in bug
> #553.
> 
> One difference is int/fp registers (but not CR fields) go like (when
> counting in vector element order):
> r0, r32, r64, r96,
> r1, r33, r65, r97,
> r2, r34, r66, r98 -- wrapping around after 4 instead of 8 registers.
> 
> for r<N> the element-order index = ((N & 0b11111) << 2) | ((N & 0b1100000) >> 5)

This still works fine with microarchitectural lanes, all that happens is lanes are instead:
lane 0: r0-r31   CR0-7   CR32-39
lane 1: r32-63   CR8-15  CR40-47
lane 2: r64-95   CR16-23 CR48-55
lane 3: r96-127  CR24-31 CR56-63
Comment 55 Luke Kenneth Casson Leighton 2021-01-13 19:32:32 GMT
(In reply to Jacob Lifshay from comment #53)

> Another difference is the special case for svp64 int/fp/cr extra2/extra3
> decoding is removed -- 

that's what i ended up doing, and as i explained, unfortunately the combination of removing the different scheme for acalar and turning the sequential iteration sideways (vertical) prohibits scalar SV entirely from accessing sone of the Vector registers.

with the linear straight sequential numbering and the original sv-extension scheme at least a Vector operation can take place using a lower-numbered dest register (starting between r0 and r60 or so) then any Scalar-SV or just plain v3.0B scalar operations can get access to the *full* range of results.

with the scheme that you proposed *this is not possible* because the numbering is hardcoded to zeros in the lower bits.  SV scalar is forced to engage a predicated VSELECT mv or a mv.x to copy every other vector element result into alternative locations that the scalar numbering *can* get at.

or, the REMAP system has to be deployed, which is even more expensive than mv.x or predication.

OpenPOWER v3.0B scalar instructions cannot get at them *at all*.

for CRs this is... well, it's not really ok but the alternative is worse.  if it was ok to do that "turnaround" (support horizontal *and* vertical numbering i.e. support *both* schemes) it would solve this but the DMs become far too complex.

for INT and FP it is definitely not ok.

it took me a while to remember that this was why we came up with the odd system in the first place.
Comment 56 Jacob Lifshay 2021-01-13 19:44:39 GMT
(In reply to Luke Kenneth Casson Leighton from comment #55)
> (In reply to Jacob Lifshay from comment #53)
> 
> > Another difference is the special case for svp64 int/fp/cr extra2/extra3
> > decoding is removed -- 
> 
> that's what i ended up doing, and as i explained, unfortunately the
> combination of removing the different scheme for acalar and turning the
> sequential iteration sideways (vertical) prohibits scalar SV entirely from
> accessing sone of the Vector registers.
> 
> with the linear straight sequential numbering and the original sv-extension
> scheme at least a Vector operation can take place using a lower-numbered
> dest register (starting between r0 and r60 or so) then any Scalar-SV or just
> plain v3.0B scalar operations can get access to the *full* range of results.
> 
> with the scheme that you proposed *this is not possible* because the
> numbering is hardcoded to zeros in the lower bits.  SV scalar is forced to
> engage a predicated VSELECT mv or a mv.x to copy every other vector element
> result into alternative locations that the scalar numbering *can* get at.

So, to be clear, you're advocating for not using the scheme I proposed just now, or not using the scheme I proposed 18 months ago as part of the SVP for RISC-V spec?
Comment 57 Luke Kenneth Casson Leighton 2021-01-13 19:55:04 GMT
(In reply to Jacob Lifshay from comment #56)

> So, to be clear, you're advocating for not using the scheme I proposed just
> now, or not using the scheme I proposed 18 months ago as part of the SVP for
> RISC-V spec?

i'd really like to use both (dynamically), that was what the CR8x8 matrix concept was.  there is room to overload elwidth to do it... however the implications for the DMs are so complex that it would be foolish to try as a first iteration.

given that if we *don't* use vertical numbering on CRs we are forced instead to add a 1 year delay on the critical path it is clearly unacceptable to use the SVP scheme... for CRs

given that it is clearly unacceptable to completely cut off entire swathes of the regfile from scalar operations forcing the use of convoluted predicated mv operations if we *do* use vertical numbering on FP and Int operations it is clearly unacceptable to use the vertical numbering scheme... for FP and INT.

conclusion: vertical numbering for CRs (reluctantly), horizontal numbering for INT and FP.
Comment 58 Alexandre Oliva 2021-01-14 14:22:07 GMT
I haven't been able to keep up with this in detail (sorry, my attention has been temporarily diverted), but I'm a little worried about how to represent a "shuffled" CR register file map, if I get the right idea of what's being proposed.

The key concepts that GCC deals with for purposes of register allocation are requirements of instructions (constraints, as in extended asms) and modes (closely related with types).

CRs and flags in general are dealt with without caring about their internal representation.  They're abstracted into different CCmodes; on machines in which different insns can output a different set of compare results, e.g. only EQ/NE, or LT/EQ/GT, LT/EQ/GT/UN, or even carry, overflow, underflow, exceptions and whatnot, those are represented as different CCmodes, when applicable, in an analogous way to integral of floating-point modes, in which wider ones can carry more information, be it precision/mantissa, be it exponent.

It's all modelled abstractly, as if the condition code register held the result of the compare rather than whatever bits the underlying hardware uses, and then, when the conditions get to be used, the mnemonics are selected based on the kind of compare result we're interested in, and the compiler remains blissfully unaware of the condition register internal representation.

So you won't see anything in GCC that cares that CRs are 4-bits wide and use one bit for EQ, one for LT, one for GT, and one for UN, in whatever order that is.  This solves some potential problems for us, because endianness of those bits is not an issue.  

There's nothing in the IR that enables reinterpretation of CR bits as an integral quantity, or vice-versa.  Indeed, CCmodes generally do not pass the TARGET_MODES_TIEABLE_P predicate with other modes, meaning you cannot reinterpret a CCmode "quantity" in a register as another mode, as you often can reinterpret a wide integral mode as a narrow one, and vice-versa, when the machine, the ABI and the compiler keep them extended under uniform conventions.

Now, the problem with "shuffled" register ordering is that the controls GCC uses to tell how modes and registers related are TARGET_HARD_REGNO_MODE_OK, that tells whether a quantity in a given machine mode can be held in a given register, and TARGET_HARD_REGNO_NREGS, that tells how many *consecutive* registers are needed to hold that mode, starting at a given register.

In order for wider-than-register modes to be held in a set of registers, those registers *have* to be contiguous in GCC's internal notion of the register file.  It is sometimes the case that the contiguity is not relevant for the architecture, e.g., if there isn't any opcode that operates on pairs of registers holding a double-word value, but these often appear when a pair of consecutive registers holds a double-precision floating-point value, or a widening multiply necessarily sets a pair of neighbor registers.  When this happens, the order of registers in the abstract register file in the compiler has to match the order and the grouping required by the machine, otherwise the allocation won't get things right.

When it comes to vectors of gprs and fprs, we didn't have the problem I'm concerned about: the vector modes can just require N contiguous registers, and since they appear as neighbors in the abstract register file, that works just fine.  Unlike other wide types, the WORDS_BIG_ENDIAN predicate doesn't affect the expected significance of partial values split across multiple registers in vector types, so we're fine in this regard.

However, if there are opcodes that require different groupings or orderings of CRs, there will be a representation problem.  E.g., if we need CR12 to be right next to CR4 because of some opcode that takes a pair of CRs by naming CR4 and affecting CR4 and CR12 as a V2CC quantity, they'd have to be neighbors for this V2CC allocation to be possible.  But if in other circumstances we use say a V8CC quantity starting at CR0 to refer to CR0..7's 32 bits, then those 8 CRs would have to be consecutive in the register file, without room for CR12 after CR4.

So please be careful with creative register ordering, to avoid creating configurations that may end up impossible to represent without major surgery in the compiler.

Also, keep in mind that, even if some configurations might be possible to represent with the knobs I mentioned above, the rs6000/powerpc port has a huge legacy of variants, so whatever we come up with sort of has to fit in with *all* that legacy.  E.g., IIRC 32-bit ppc variants have long used consecutive 32-bit FPRs for (float+float) double-precision-ish values, and consecutive 32-bit GPRs to hold 64-bit values.  There were ABI requirements to that effect, that required the abstract register file in the compiler, and also that in debug information, to use the register ordering implied by the architecture.  If we were to require the introduction of intervening registers, for purposes of vectorization, between registers that such old arches need as neighbors, insurmountable conflicts will arise.
Comment 59 Jacob Lifshay 2021-01-14 18:53:27 GMT
As mentioned in bug #553 comment #3, I think Alexandre gave a good enough reason to avoid the reordered registers.
Comment 60 Luke Kenneth Casson Leighton 2021-01-14 19:25:52 GMT
preamble 1 this is so long that i am going to need to do a followup summary.  this alone may take me half an hour.  patience appreciated.

preamble 2

found this:
https://gcc.gnu.org/onlinedocs/gccint/Condition-Code.html

which is not hugely informative but at least gives hints

and this:
https://gcc.gnu.org/onlinedocs/gccint/Machine-Independent-Predicates.html#Machine-Independent-Predicates

which shows that lt/gt/le... etc (BO Branch style CR tests) are representable in CRs.

preamble 3: the following is very relevant, and important to note that where RVV gcc work leads, we will have an easier (less expensive) time following.

https://www.embecosm.com/2018/09/09/supporting-the-risc-v-vector-extension-in-gcc-and-llvm/
https://gcc.gnu.org/legacy-ml/gcc/2018-09/msg00037.html

BUT... it may be the case that gcc is simply "left behind" with the primary focus being on LLVM.  instead, we may have to be the ones that take the initiative.



(In reply to Alexandre Oliva from comment #58)
> I haven't been able to keep up with this in detail (sorry, my attention has
> been temporarily diverted), but I'm a little worried about how to represent
> a "shuffled" CR register file map, if I get the right idea of what's being
> proposed.

[appreciated your concerns.  for context here: we need your input to determine, rather quickly, if the CR remap is viable.  i am USD 8,000 in debt on credit cards and NLnet does not donate for time, only for results.  i apologise that this puts us under pressure, i.e. we need to make fast pragmatic decisions, even if they turn out mid-to-long-term to have been wrong.]

providing some perspective: it is and it isn't shuffled.  one way to imagine it is:

* thing indexed 0 has 2 names: CR0 and CR0.0
* thing indexed 1 has 1 name: CR0.1
* ...
* .....        15     1     : CR0.15
* thing indexed 16 has 2 names: CR1 and CR1.0
*               17     1        CR1.1
* ...
* .......      127     1        CR7.15

thus it is still sequential and linear, it's just that the scalar v3.0B registers have been placed into positions that are now every 16th in the sequence.

if on the other hand you think of it as being grouped linearly by CRn.0 as being sequentially numbered, followed by all CRn.1 etc CR0.0 CR1.0 ... CR7.0 as indexed 0-7 then CR0.1 CR1.1 ... as indexed 8-15 and so on *then* it is discontiguous and yes it would cause misunderstandings and a lot of trouble (and, you point out below: unworkable for Vectors)

it was 9 days before i was able to grasp this difference fully and that alone is of serious concern.

if the conceptual numbering of scalar CRs can indeed be simply shifted up 4 bits, given a different name format (CRn.MM) that binutils recognises and sorts out, would that be viable?


> The key concepts that GCC deals with for purposes of register allocation are
> requirements of instructions (constraints, as in extended asms) and modes
> (closely related with types).
> 
> CRs and flags in general are dealt with without caring about their internal
> representation.  
> [ ... ]

interesting. and that's why they don't end up in the frontend because of the abstraction.  you'd need to bring out *CCmodes* into the frontend in order to apply __attribute__ to them, and that's clearly not going to be happening.

> It's all modelled abstractly, as if the condition code register held the
> result of the compare rather than whatever bits the underlying hardware
> uses,

sounds sensible to me.


> So you won't see anything in GCC that cares that CRs are 4-bits wide and use
> one bit for EQ, one for LT, one for GT, and one for UN, in whatever order
> that is.  This solves some potential problems for us, because endianness of
> those bits is not an issue.  

also elwidth overrides are meaningless so don't enter into the conversation at all.
 
> There's nothing in the IR that enables reinterpretation of CR bits as an
> integral quantity, or vice-versa.

*click*... this may cause problems for what i called "crweird" instructions.

ah wait... there is precedent: isel and setb.  these interact to select/set INT regs based on a CCode (CR).

to give some context: the crweird instructions are a way to transfer CRs-as-predicates into scalar INTs and vice-versa.  we need these so as not to have to add duplicate instructions (same functionality, one on CRs one on INTs)

    mtcrweird: RA, BB, mask.mode

    reg = (RA|0)
    lsb = reg[63] # MSB0 numbering sigh
    n0 = mask[0] & (mode[0] == lsb)
    n1 = mask[1] & (mode[1] == lsb)
    n2 = mask[2] & (mode[2] == lsb)
    n3 = mask[3] & (mode[3] == lsb)
    CR{BB}.eq = n0
    CR{BB}.lt = n1
    CR{BB}.gt = n2
    CR{BB}.ov = n3

you can see if that is Vectorised it is intended to put arithmetic bit 1 of RA into CR{BB+1} etc etc etc.

hence the name "weird".

now, if this *cannot be represented* in gcc we are in a bit of... um... schtuck.

one potential conceptual route is to internally "typecast" the CCodes into predication bits (including potentially the transformation process)

another would be for predication to just take a range of CR-vectors, say to the register allocator, "MINE! hands off!" and the CCodes side never talks to the predication side.



>  Indeed, CCmodes generally do not pass the
> TARGET_MODES_TIEABLE_P predicate with other modes, meaning you cannot
> reinterpret a CCmode "quantity" in a register as another mode, as you often
> can reinterpret a wide integral mode as a narrow one, and vice-versa, when
> the machine, the ABI and the compiler keep them extended under uniform
> conventions.

ah.  small diversion needed, come back to numbering in a minute.

we do really need the ability to consider CRs as predicate bits, otherwise if we have to use only integers we lose a huge amount of capability, and the hardware becomes either unmanageably complex or severely performance-compromised.

now, whether that's done as CCModes are also predicates in gcc or not?

question: can "CCodes-with-compares" at least be "typecast" to a new kind of CCode, a "predicate" CCode?  or to an underlying existing predicate type in gcc?

i assumed that this would be possible, at least in some fashion, even if it requires some hoops to jump through.

the concept behind CRs-as-predicates i copied how Branch BO field works, because Branches test CCodes in exactly the same way to do if/else in scalar that you use predicates to perform vector "variants" of if/else:

    if x > y: # cmp creates CR
       # branch on CR with BO created here
       x -= 5

vector version would be:

    VectorCMP x,y # creates vector of CRs
    svp.mask=CR,BO=gt addi x.v, -5

anyway.  back to numbering.


> Now, the problem with "shuffled" register ordering is that the controls GCC
> uses to tell how modes and registers related are TARGET_HARD_REGNO_MODE_OK,
> that tells whether a quantity in a given machine mode can be held in a given
> register, and TARGET_HARD_REGNO_NREGS, that tells how many *consecutive*
> registers are needed to hold that mode, starting at a given register.

right.  and MVL, which is *very specifically only allowed to be set statically by an immediate*, defines that quantity.

it's changeable on a "per-setmvli" basis, but it *is* a static compile-time quantity.

thus the compiler may decide, at the simplest crudest level, "screw it, MVL is hardcoded to 4 or 8 or 16 and that's the end of it", which effectively turns SV into a type of brain-dead but functional predication-capable SIMD ISA, or it may be a bit more intelligent about it and decide on a per-function basis what the best allocation is, to help avoid register spill.

 
> In order for wider-than-register modes to be held in a set of registers,
> those registers *have* to be contiguous in GCC's internal notion of the
> register file. 

this is why i described the conceptual numbering for CRs as "being viewable as contiguous if you don't mind interspersing the scalar CRs every 16th index"

however, Alexandre, just a heads-up: REMAP *COMPLETELY* obliterates the expectation of linear numbering, by design.

this is something that is supported in NEON as hard-coded in LDST, called "Structure Packing", and it is also now in RVV 0.9.  typical uses include Matrix Multiplies and for getting all the RRRRR and GGGGG and BBBBB into contiguous registers where data was actually in RGBRGBRGB.

just so you know: REMAP can be applied to the *entire* ISA.  any arithmetic vector op, any MV, any LDST.


> It is sometimes the case that the contiguity is not relevant
> for the architecture, e.g., if there isn't any opcode that operates on pairs
> of registers holding a double-word value, but these often appear when a pair
> of consecutive registers holds a double-precision floating-point value, or a
> widening multiply necessarily sets a pair of neighbor registers.

yes, there are a number of instruction examples in many ISAs that support this double-op SIMD and widen/narrow, it is no surprise then that gcc has had to understand this.

it will become particularly interesting, a long LONG way down the line, how SV's polymorphic elwidth overrides end up being implemented, ultimately.

intermediary steps there will clearly have to involve avoiding different src-dest overrides on arithmetic operations initially, and using (inefficient) patterns of MVs that "mirror" the same widening-narrowing explicit opcodes typically added to SIMD architectures.

that will give breathing space to allow a full research investigation into how to add polymorphic elwidth overrides to arithmetic ops.

i mean in a generic fashion, rather than as special-cased for certain specific instructions.

this btw is going to happen rather a lot: the "abstraction" of SV means that the compromises taken by most ISAs (only certain ops have saturation, only certain ops have widen/narrow) *do not have to be taken*


>  When this
> happens, the order of registers in the abstract register file in the
> compiler has to match the order and the grouping required by the machine,
> otherwise the allocation won't get things right.

understood.  this instinctively is why i really do not like the vertical stratification.  scalar registers are no longer accessible in a contiguous block.


also this is one of the things that is making me slightly nervous about the CRn.MM numbering: it doesn't match precisely one-for-one with the INT/FP arrangement.

as in: yes you can rearrange the naming so that it *looks* contiguous, but try accessing them as scalar and it all goes sideways.  literally.

when treated as Vectors-of-results that generate corresponding Vectors-of-CRs the numbering matches.  the names are weird but the numbering matches.

however the moment you try to access those values as *scalar* despite the fact that they were just produced by an instruction just before, all hell breaks loose.

not only can you not *get* access to CR3.15 directly for example (you have to insert a predicated MV operation to copy it to CR3 or CR3.8 for example) you have to run a calculation to work out the FP/INT reg it's associated with.  something like:

     (idx&0b1111)<<3 | (idx&0b1110000)>>4

that's the relationship between CR numbering and INT/FP numbering.

(no Jacob, just to emphasise again: making all INT/FP/CR numbering the same by applying the same N.MM remapping isn't ok, unfortunately, because the entire hardware of 18 months needs to be abandoned and rethought)  


> When it comes to vectors of gprs and fprs, we didn't have the problem I'm
> concerned about: the vector modes can just require N contiguous registers,
> and since they appear as neighbors in the abstract register file, that works
> just fine. 

it's "obvious" in other words.  and, in addition, once a Vector of INT/FO results is computed, if access to those is required explicitly by scalar then as long as the Vector was kept to the lower half of the regfile they are also accessible directly *and accessible linearly as well*.

* Vector add may start at r0, r4, r8, ...
   r120, r124 and progresses linearly.
   vector at r0 progresses r0 r1 r2 r3 r4

* Scalar access may be at any of r0-r63
  so it is only the upper range of Vectors
  that cannot be accessed.
  (without a Vector mv, that is)

CRs on the other hand:

* Vector CRs may start at CR0.0 CR0.8
  CR1.0 CR1.8 ... CR7.0 CR7.8
  vectors progress CR0.0 CR0.1 CR0.2

* Scalar access MAY NOT even refer AT ALL
  to CR1.1, CR1.2 throughout the FULL
  RANGE of the regfile, ALL the way to
  CR7.2 and CR7.15.

this discontiguity is why the "slightly weird" algorithm of treating scalar numbering differently from Vector was added.

> Unlike other wide types, the WORDS_BIG_ENDIAN predicate doesn't
> affect the expected significance of partial values split across multiple
> registers in vector types, so we're fine in this regard.

ok.

> However, if there are opcodes that require different groupings or orderings
> of CRs, there will be a representation problem.  E.g., if we need CR12 to be
> right next to CR4 because of some opcode that takes a pair of CRs by naming
> CR4 and affecting CR4 and CR12 as a V2CC quantity, they'd have to be
> neighbors for this V2CC allocation to be possible. 

right.  well, the only "fly in the ointment" is mfcr and the fxm version when used with multiple bits (which i think i'm right in saying you're not supposed to do but all hardware supports it).

the example involving CR4 and CR12 is actually realistic when CR12 is "renamed" correctly to CR4.1 (4+8=12)

accessing CR4 and CR4.1 *can* be done under a Vector op. they are contiguous.

they can NOT repeat NOT be accessed sequentially via a SCALAR op.  CR4.1 is not even accessible AT ALL.

CR4 and CR4.8 *would* be accessible contiguously via scalar, but not CR4.1



> But if in other
> circumstances we use say a V8CC quantity starting at CR0 to refer to
> CR0..7's 32 bits, then those 8 CRs would have to be consecutive in the
> register file, without room for CR12 after CR4.

the idea is - was - i have already convinced myself it's a bad idea - that a V8CC would be "CR0.0 CR0.1 ... CR0.7"


 
> So please be careful with creative register ordering, to avoid creating
> configurations that may end up impossible to represent without major surgery
> in the compiler.

by going through it, above, i've basically convinced myself it's not just a bad idea to do vertical sequencing, it's a *really* bad idea.
 
> Also, keep in mind that, even if some configurations might be possible to
> represent with the knobs I mentioned above, the rs6000/powerpc port has a
> huge legacy of variants, so whatever we come up with sort of has to fit in
> with *all* that legacy.  E.g., IIRC 32-bit ppc variants have long used
> consecutive 32-bit FPRs for (float+float) double-precision-ish values, and
> consecutive 32-bit GPRs to hold 64-bit values.  There were ABI requirements
> to that effect, that required the abstract register file in the compiler,

wait.... whuuu???

oh god... this is the VSX/Altivec thing isn't it?  where the INT/FP regfiles are combined then "recast" to a 32-entry 128 bit SIMD regfile, something like that?

please please bear in mind we are doing *nothing* like that!  we did consider it (a long while back), to basically merge the FP and INT regfiles on top of each other.

to reiterate and emphasise: we are going *nowhere near* VSX, which i consider to be a harmful legacy ISA, good as it was in 2001 it's time for it to be retired, and if we do ever "support" it in 2 or more years time it will be under serious protest and with absolute bare minimum attention, resources, performance and impact on the existing HDL.

there is a reason why NXP has abandoned OpenPOWER, and that reason is: VSX.

in technical terms anything that you "learn" from VSX, anything involving regfile typecasting such as the one for rs6000, these truly need to be set aside.

unlike rs6000:

* the SV INT regfile is polymorphic on
  *elwidth* not the type (except int and
  logical, exactly as in v3.0B, long
  before SV existed)

* likewise the FP regfile is polymorphic
  on width, but may NOT be typecast to INT
  (or logical ops)

you CANNOT put a raw integer into an FP or vice-versa then have either fed to either FP or INT pipeline as if it was in the other.

in SV, exactly as with v3.0B:

* GPR is GPR, FPR is FPR.
* GPR operations are only possible on the GPR regfile
* FP operations are only possible on the FP regfile

rs6000, as i understand it, if you perform an INT VSX operation on VSX registers numbered in a certain range the result is stored in the *SCALAR* FP regfile, correct?

i ask because we are NOT repeat NOT doing that in SV.  considered it.  rejected it.
 


> and also that in debug information, to use the register ordering implied by
> the architecture.  If we were to require the introduction of intervening
> registers, for purposes of vectorization, between registers that such old
> arches need as neighbors, insurmountable conflicts will arise.

i am not quite following, inasmuch as thay due to SIMD being harmful and the sheer overwhelming quantity of opcodes involved there is no intention in my mind to support any of VSX and if we do it will be as seriously (and very deliberately) performance compromised, so badly that software developers work very hard to avoid using VSX entirely.

with that in mind i am slightly confused.  are you saying:

* "if there is any intention to support VSX in *addition* to SV it will be difficult to do so"

if so that is not in the slighest bit a problem because if there is even the tiniest chance that SV is compromised by VSX, then, metaphorically and clinically, VSX gets shot in the head: problem goes away.

* "even if you DON'T intend to "support" VSX, the way that gcc support for legacy hardware such as rs6000 is written, SV will *still* be challenging, despite SV not even being close, at all, to how VSX works."

the former is not a problem at all (VSX is a harmful SIMD ISA, it is a real easy choice to say "goodbye VSX")

the latter would, *deep breath*, require some further investigation.

i hope the former.
Comment 61 Jacob Lifshay 2021-01-14 19:47:09 GMT
(In reply to Luke Kenneth Casson Leighton from comment #60)
> > Also, keep in mind that, even if some configurations might be possible to
> > represent with the knobs I mentioned above, the rs6000/powerpc port has a
> > huge legacy of variants, so whatever we come up with sort of has to fit in
> > with *all* that legacy.  E.g., IIRC 32-bit ppc variants have long used
> > consecutive 32-bit FPRs for (float+float) double-precision-ish values, and
> > consecutive 32-bit GPRs to hold 64-bit values.  There were ABI requirements
> > to that effect, that required the abstract register file in the compiler,
> 
> wait.... whuuu???
> 
> oh god... this is the VSX/Altivec thing isn't it? 

Nope.

> where the INT/FP regfiles
> are combined then "recast" to a 32-entry 128 bit SIMD regfile, something
> like that?

VSX doesn't do that.

Back to ABI stuff...
successive floating-point registers are used to store IBM's special (super annoying) double-double form of long double (which I think should be relegated permanently to the history books and 128-bit IEEE float used instead, but we need to support what legacy programs expect...). They are also used for float/double complex numbers IIRC, where the first register stores the real half and the second register stores the imaginary half.

Just think of what mess can be achieved with a complex double-double number... XD.

I think they're also used for passing by-value structs with 2 float/double fields. similarly, but with int regs for structs with 2 integer (one of char, int, long, etc.) fields.
Comment 62 Jacob Lifshay 2021-01-14 19:48:50 GMT
Oops, forgot to trim context...sorry.
Comment 63 Alexandre Oliva 2021-01-14 20:33:36 GMT
What Jacob said.  I may have misremembered the double+double long double thing as float+float double.  But pairs of (neighboring) 32-bit GPRs for 64-bit values in 32-bit mode are definitely a thing, at least ABI-wise.

Anyway, the main point I wanted to warn against was the introduction of new registers between preexisting registers that are used in ways that require them to remain contiguous.  That would be a very serious difficulty.  It doesn't look like CR#.## numbering does that, because I don't see that CRs are used contiguously anywhere.  Debug info numbering might need remapping, but that's not a big deal.

A big deal would be having (possibly legacy) opcodes/ABIs that require registers, from any register file, to be grouped and used together in a way that required them to be contiguous in a certain way, while having other (new) opcodes that require them to be grouped in ways that impose different contiguity requirements, because it would be impossible to satisfy both at the same time.  That what I was trying to avoid, and that's why I mentioned the main GCC knobs to configure the parameters for register allocation.

As for vector CC modes, as long as you use the CC modes as outputs to vector compares, and then use them for predication of vectors of the same length, this will be (I believe) no different from existing uses of CRs as outputs to scalar compare insns and inputs to conditional moves (and conditional branches, but those won't take vectors, I hope ;-)

GCC's "movecc_<mode>" template insn in rs6000.md covers even moving the various CCmodes from and to memory, through GPRs, but it's still the case that GCC has no clue about the bit patterns that represent the different values represented in these modes.  That's left entirely up to the architecture to choose, and implement consistently between CR-setting and CR-using insns.

In order to use vector CRs as predicates with predefined values, rather than as ones computed by vector compares, or by vector insns that set vector CRs as side effects, the programmers who wish to use such predicates will either have to figure out how to get the right bit patterns loaded into the CR vectors, or set up other vectors and perform operations between them that set the vectors accordingly.

Since CRs are smaller than words, it is possible that the same anomaly that I reported elsewhere about endianness in using bitfields as predicate registers will apply to CR vectors, and users will have to use integral values that vary depending on endianness to obtain equivalent CR vector predication.
Comment 64 Luke Kenneth Casson Leighton 2021-01-15 00:49:40 GMT
briefly (more tomorrow):

* https://libre-soc.org/openpower/isa/condition/

  cr ops would be fine to Vectorise
  vertically

* https://libre-soc.org/openpower/isa/sprset/

  mtcrf which is a multi-bit (8 bit mask)
  field, FXM, is *NOT* fine to Vectorise

  or if it is, it would have to be done
  as a FSM by reading "columns" of CRs,
  one for each "bit" in FXM.  if 5 bits
  were set, it would take 5 clock cycles.

however the fundamental "meaning" of mtcrf is, exactly as you describe Alexandre, the numbering is contiguous and sequential.  whether *internally* gcc treats it that way? i don't know.

basically i am talking us out of doing vertical CR Vector numbering

the downside of that is that even the most basic of augmentation to gcc will require it to understand CR allocations for predicates.

which leaves us pretty much screwed *except* of course there is the c++ macro idea that jacob came up with.

if Lauri really doesn't like c++ i am reasonably confident that something similar involving macros could also be dreamed up.  it'll most likely look absolutely dreadful but hey anyone who is a c programmer is used to that.

Alexandre just so you know i have asked if it is ok to reallocate one of the cancelled RISCV budgets to gcc and binutils.  this is a decision not taken lightly because it requires an unprecedented change to the MoU under which that task was cancelled, and may result in questions from the external Audit team that will need answering adequately.
Comment 65 Luke Kenneth Casson Leighton 2021-01-15 00:53:03 GMT
https://github.com/gcc-mirror/gcc/blob/master/gcc/config/rs6000/rs6000.md

holy cow that file is so long it won't load properly in a browser.
Comment 66 Luke Kenneth Casson Leighton 2021-01-15 15:12:33 GMT
(In reply to Alexandre Oliva from comment #63)
> What Jacob said.  I may have misremembered the double+double long double
> thing as float+float double.  But pairs of (neighboring) 32-bit GPRs for
> 64-bit values in 32-bit mode are definitely a thing, at least ABI-wise.

we're not doing 32 bit backwards compatibility.  i mean, we could, but not now.  aside from anything it interacts with "elwidth=default" which is now 32 bit not 64.

32 bit mode can therefore be disregarded as far as Vectorisation is concerned


> A big deal would be having (possibly legacy) opcodes/ABIs that require
> registers, from any register file, to be grouped and used together in a way
> that required them to be contiguous in a certain way, while having other
> (new) opcodes that require them to be grouped in ways that impose different
> contiguity requirements, because it would be impossible to satisfy both at
> the same time.

mtcr throws the spanner in the works, there.
 
> As for vector CC modes, as long as you use the CC modes as outputs to vector
> compares, and then use them for predication of vectors of the same length,

ohh yeh.  no it would be possible but too complex to change MVL in the middle.

leaving that aside...

> this will be (I believe) no different from existing uses of CRs as outputs
> to scalar compare insns and inputs to conditional moves

that's what i imagine, yes.  isel, setb, fsel.

but, caveat: these (isel setb) are not the full/only way and they are tricks used (rarely) by scalar ISAs that do not have "real" predication built-in

given that ppc64 does not have predication *at all* it is x86, SVE2 and others that may need to be examined for clues as to how this can be done.

(adding predication to VSX has been a constant feature request made to IBM for some time)

> (and conditional
> branches, but those won't take vectors, I hope ;-)

:)  on the RV ISA list we did laugh at the idea of Vectorised Branches, effectively this becomes a way to create hyperthreaded coroutines (!)

are we implementing that? eeeehhhno.  love the idea though.

there *will* be a "Reduce Mode" on CRVectors though which will allow:

* V-results to create V-CRs
* Crunch (mapreduce) V-CRs to scalar CR
* Scalar CR can do *Standard* scalar
  branch test

this is "equivalent" but much more powerful than VSX CR6: VL up to 64 can perform far more parallel work, in fewer instructions, than VSX.

thus it is possible, using Vector-CRops (V-crand, V-crxor) to check the status of a *batch* of V-results 


> GCC's "movecc_<mode>" template insn in rs6000.md covers even moving the
> various CCmodes from and to memory, through GPRs, but it's still the case
> that GCC has no clue about the bit patterns that represent the different
> values represented in these modes.  That's left entirely up to the
> architecture to choose, and implement consistently between CR-setting and
> CR-using insns.

fascinating.
 
> In order to use vector CRs as predicates with predefined values, rather than
> as ones computed by vector compares, or by vector insns that set vector CRs
> as side effects, the programmers who wish to use such predicates will either
> have to figure out how to get the right bit patterns loaded into the CR
> vectors, or set up other vectors and perform operations between them that
> set the vectors accordingly.

it is fairly normal to write optimised libc6 Vector/SIMD routines in assembler.  the transfers using crweird (INT-CRVec) will be needed for strncpy, memcpy etc where INT "set-before-first" opcodes are needed.

> Since CRs are smaller than words, it is possible that the same anomaly that
> I reported elsewhere about endianness in using bitfields as predicate
> registers will apply to CR vectors, and users will have to use integral
> values that vary depending on endianness to obtain equivalent CR vector
> predication.

reminder: both microwatt and libresoc do *not* internally perform dynamic endianness conversion on ALUs, or on any regfiles, of any kind.  endianness is *removed* at memory... period.

this is to preserve the HDL developers sanity.

[aside: byte-reversal on GPR-GPR interaction is now a property of REMAP that gives the "illusion" of having LE/BE GPR regfile capability.  but that is GPR-GPR not GPR-CR.]

transfers between CR regfile and GPR INT regfile are DIRECT and HARD coded to one and ONLY one endianness.  we may call this LE if it helps.

transfers between CR-Vectors and INTs is **NOT** intended to be done beyond 64 bits, because VL is limited to 64 bits.  therefore AT NO TIME will there be transfers between CR-Vectors and INT-Vectors (hence the opcode name "crweird").

thus there will neverrrr be a problem, everrr, where endianness is a complicating factor involving CR-Vector to INT-as-predicate.

however if the developer mistakenly enables REMAP bytereversal when interacting with the INT *after* transfer out of CR-Vector that becomes their problem to sort out their incorrect program.
Comment 67 Luke Kenneth Casson Leighton 2021-01-18 13:23:50 GMT
folks i restored the horizontal sequential numbering for CRs: the cross-interference and many other factors makes this necessary. i did however think it worthwhile to increase them to 128.

this means unfortunately that significamt explicit internal modifications to gcc are unavoidable to get it to recognise the concept of CR-vectors: the trick of assuming that "scalar code producing a CR" may considered to be "multiple scalars code producing multiple CRs" still applies, it's just that there is a bit more work needed.

i am still waiting to hear from NLnet about the possibility of reassigning a budget.
Comment 68 Jacob Lifshay 2021-01-18 16:58:18 GMT
(In reply to Luke Kenneth Casson Leighton from comment #67)
> folks i restored the horizontal sequential numbering for CRs: the
> cross-interference and many other factors makes this necessary. i did
> however think it worthwhile to increase them to 128.

sounds good from a SW consistency perspective!