Bug 213 - SimpleV Standard writeup needed
Summary: SimpleV Standard writeup needed
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Alain D D Williams
URL:
Depends on: 527 529 533 534 535 552 554 555 558 559 560 561 563 564 568 569 572 127 562 570
Blocks: 174
  Show dependency treegraph
 
Reported: 2020-03-09 21:59 GMT by Luke Kenneth Casson Leighton
Modified: 2021-01-18 00:47 GMT (History)
7 users (show)

See Also:
NLnet milestone: NLNet.2019.10.Standards
total budget (EUR) for completion of task and all subtasks: 8000
budget (EUR) for this task, excluding subtasks' budget: 8000
parent task for budget allocation: 174
child tasks for budget allocation: 527 533 535 564
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-03-09 21:59:16 GMT
The SimpleV Standard, applied to POWER ISA, needs to be written up in a form suitable for proposal to the OpenPOWER Foundation.  Followthrough to adoption included.
Current draft starting page: https://libre-soc.org/simple_v_extension/

evolving at:

https://libre-soc.org/openpower/sv
Comment 1 Luke Kenneth Casson Leighton 2020-08-22 18:01:22 BST
Alain regarding our conversation, a sub-bug needs to be raised for discussing the alterations to SVPrefix.

PowerISA "forms" need to be broken into subcategories and a "prefix format" associated with each.  the openpower isatables contain the "forms" alongside the bitfields.
Comment 2 Luke Kenneth Casson Leighton 2020-08-22 18:01:56 BST
https://libre-soc.org/simple_v_extension/sv_prefix_proposal/
Comment 3 Luke Kenneth Casson Leighton 2020-08-26 10:49:03 BST
https://git.libre-soc.org/?p=libresoc-isa-manual.git;a=blob;f=powerpc-add/src/SVPrefix.tex;hb=HEAD#l133

this table needs replacing with

https://libre-soc.org/openpower/isatables/fields.text

then each "form" classified and given an appropriate prefix.
Comment 4 Luke Kenneth Casson Leighton 2020-08-28 12:58:00 BST
https://git.libre-soc.org/?p=libresoc-isa-manual.git;a=blob;f=powerpc-add/src/SVPrefix.tex;hb=HEAD

it occurred to me that if we were to do SVPrefix by examining the fields.txt
and the Forms, we would literally be at it forever.

the technique developed by the microwatt team, to create micro-ops, has *already*
done that analysis, and we hold that information in CSV files.

therefore if we instead do the analysis of which prefix is needed based on
the *CSV* files we save a vast amount of time.

it may also be possible to partially (or completely) automate the process
of determining the prefixes.
Comment 5 Jacob Lifshay 2020-08-28 14:47:03 BST
the chosen prefixes should mesh well with the 64-bit instructions that were added in v3.1 of the spec.
Comment 6 Luke Kenneth Casson Leighton 2020-08-28 15:25:08 BST
(In reply to Jacob Lifshay from comment #5)
> the chosen prefixes should mesh well with the 64-bit instructions that were
> added in v3.1 of the spec.

unfortunately that is very unlikely.  we need *eight* major opcodes in order to fit POWER-Compressed, SVP32, SVP48, SVP64 and VBLOCK.

before the v3.1 spec additions there were only 8 spare major opcodes.

the 6 bits of POWER major opcodes only leaves 10 bits for SVP32 and SVP48. this is not enough: we need 11.

or, we allow opcode 1 but sacrifice POWER-Compressed which puts it under even more pressure than it already is.

i have a much more effective scheme than v3.1 prefixing, called a "Data Pointer"
Comment 7 Cole Poirier 2020-08-28 20:12:14 BST
(In reply to Luke Kenneth Casson Leighton from comment #4)
> https://git.libre-soc.org/?p=libresoc-isa-manual.git;a=blob;f=powerpc-add/
> src/SVPrefix.tex;hb=HEAD
> 
> it occurred to me that if we were to do SVPrefix by examining the fields.txt
> and the Forms, we would literally be at it forever.
> 
> the technique developed by the microwatt team, to create micro-ops, has
> *already*
> done that analysis, and we hold that information in CSV files.
> 
> therefore if we instead do the analysis of which prefix is needed based on
> the *CSV* files we save a vast amount of time.
> 
> it may also be possible to partially (or completely) automate the process
> of determining the prefixes.

(In reply to Luke Kenneth Casson Leighton from comment #6)
> (In reply to Jacob Lifshay from comment #5)
> > the chosen prefixes should mesh well with the 64-bit instructions that were
> > added in v3.1 of the spec.
> 
> unfortunately that is very unlikely.  we need *eight* major opcodes in order
> to fit POWER-Compressed, SVP32, SVP48, SVP64 and VBLOCK.
> 
> before the v3.1 spec additions there were only 8 spare major opcodes.
> 
> the 6 bits of POWER major opcodes only leaves 10 bits for SVP32 and SVP48.
> this is not enough: we need 11.
> 
> or, we allow opcode 1 but sacrifice POWER-Compressed which puts it under
> even more pressure than it already is.
> 
> i have a much more effective scheme than v3.1 prefixing, called a "Data
> Pointer"

Ooh very cool/exciting, I can't wait to see how this develops into final form.
Comment 8 Luke Kenneth Casson Leighton 2020-10-05 21:54:13 BST
a practical analysis of the OpenPOWER ISA, in order to work out the prefixes, should begin by identifying the format of opcodes.  analysing the full ISA in its "Forms" is impractical, however in the microcode format defined by microwatt it is much easier.

i therefore asked cole to collate the CSV files, strip the opcode function unit comment and sgl fields and to sort and uniquify the results, then cross-reference those against the original CSV files to identify the "types" of instructions.

how many 2-operand instructions, how many 3-operand etc.

a crucial thing we will have to decide is what to do about CRs.  VSX auto-collates into CR7 which i am not keen on.
Comment 9 Cole Poirier 2020-10-06 02:13:49 BST
(In reply to Luke Kenneth Casson Leighton from comment #8)
> i therefore asked cole to collate the CSV files, strip the opcode function
> unit comment and sgl fields and to sort and uniquify the results, then
> cross-reference those against the original CSV files to identify the "types"
> of instructions.

Got to the point of a 'table' with all of the insns (excluding sprs becuase they dont have common column names with all the other opcodes), but it seems that all rows are 'unique', which leads me to believe that I misunderstood what you meant by unique. On what metric, column, or columns should the uniqueness be tested?
 
> how many 2-operand instructions, how many 3-operand etc.
> 
> a crucial thing we will have to decide is what to do about CRs.  VSX
> auto-collates into CR7 which i am not keen on.
Comment 10 Luke Kenneth Casson Leighton 2020-10-06 12:11:38 BST
(In reply to Cole Poirier from comment #9)
> (In reply to Luke Kenneth Casson Leighton from comment #8)
> > i therefore asked cole to collate the CSV files, strip the opcode function
> > unit comment and sgl fields and to sort and uniquify the results, then
> > cross-reference those against the original CSV files to identify the "types"
> > of instructions.
> 
> Got to the point of a 'table' with all of the insns (excluding sprs becuase
> they dont have common column names with all the other opcodes), 

they have a selection method based on the column name (see PowerDecoder2
or more specifically DecodeInA spr_out field) however we are not
going to do parallel/vector SPR read/writes/operations.

> but it seems that all rows are 'unique',

that's why i said strip the opcode, strip the comment column, strip the
function unit column (and the sgl column).  as in: destroy / remove entirely.

oh, and now that i look at it: strip (destroy, remove) the opcode column as
well

opcode,unit,internal op,in1,in2,in3,out,CR in,CR out,inv A,inv out,cry in,cry out,ldst len,BR,sgn ext,upd,rsrv,32b,sgn,rc,lk,sgl pipe,comment,form

-->

in1,in2,in3,out,CR in,CR out,inv A,inv out,cry in,cry out,ldst len,BR,sgn ext,upd,rsrv,32b,sgn,rc,lk,form

we probably also don't want the "form" field, or the "inv A" or "inv out"
or "rsrv" because those are cache-related not register-related.

-->

in1,in2,in3,out,CR in,CR out,cry in,cry out,ldst len,BR,sgn ext,upd,32b,sgn,rc,lk

there's just no way that after stripping all those columns there are no duplicates.

can you drop the results into the wiki, so we can take a look.
Comment 11 Cole Poirier 2020-10-06 21:38:11 BST
(In reply to Luke Kenneth Casson Leighton from comment #10)
> (In reply to Cole Poirier from comment #9)

> they have a selection method based on the column name (see PowerDecoder2
> or more specifically DecodeInA spr_out field) however we are not
> going to do parallel/vector SPR read/writes/operations.

Interesting!
 
> > but it seems that all rows are 'unique',
> 
> that's why i said strip the opcode, strip the comment column, strip the
> function unit column (and the sgl column).  as in: destroy / remove entirely.

I remember, I did that, there were no duplicates...

> oh, and now that i look at it: strip (destroy, remove) the opcode column as
> well
> 
> opcode,unit,internal op,in1,in2,in3,out,CR in,CR out,inv A,inv out,cry
> in,cry out,ldst len,BR,sgn ext,upd,rsrv,32b,sgn,rc,lk,sgl pipe,comment,form
> 
> -->
> 
> in1,in2,in3,out,CR in,CR out,inv A,inv out,cry in,cry out,ldst len,BR,sgn
> ext,upd,rsrv,32b,sgn,rc,lk,form
> 
> we probably also don't want the "form" field, or the "inv A" or "inv out"
> or "rsrv" because those are cache-related not register-related.
> 
> -->
> 
> in1,in2,in3,out,CR in,CR out,cry in,cry out,ldst len,BR,sgn
> ext,upd,32b,sgn,rc,lk
> 
> there's just no way that after stripping all those columns there are no
> duplicates.

This did work in constrast to the the more inclusive above set of columns. Reduced from 259 rows to 131. 
 
> can you drop the results into the wiki, so we can take a look.

Done here: https://libre-soc.org/openpower/opcode_regs_deduped

Although I haven't done the join with the original opcode column. Do you want just that final product or both?
Comment 12 Luke Kenneth Casson Leighton 2020-10-06 22:19:02 BST
(In reply to Cole Poirier from comment #11)
> (In reply to Luke Kenneth Casson Leighton from comment #10)
> > (In reply to Cole Poirier from comment #9)
> 
> > they have a selection method based on the column name (see PowerDecoder2
> > or more specifically DecodeInA spr_out field) however we are not
> > going to do parallel/vector SPR read/writes/operations.
> 
> Interesting!
>  
> > > but it seems that all rows are 'unique',
> > 
> > that's why i said strip the opcode, strip the comment column, strip the
> > function unit column (and the sgl column).  as in: destroy / remove entirely.
> 
> I remember, I did that, there were no duplicates...
> 
> > oh, and now that i look at it: strip (destroy, remove) the opcode column as
> > well
> > 
> > opcode,unit,internal op,in1,in2,in3,out,CR in,CR out,inv A,inv out,cry
> > in,cry out,ldst len,BR,sgn ext,upd,rsrv,32b,sgn,rc,lk,sgl pipe,comment,form
> > 
> > -->
> > 
> > in1,in2,in3,out,CR in,CR out,inv A,inv out,cry in,cry out,ldst len,BR,sgn
> > ext,upd,rsrv,32b,sgn,rc,lk,form
> > 
> > we probably also don't want the "form" field, or the "inv A" or "inv out"
> > or "rsrv" because those are cache-related not register-related.
> > 
> > -->
> > 
> > in1,in2,in3,out,CR in,CR out,cry in,cry out,ldst len,BR,sgn
> > ext,upd,32b,sgn,rc,lk
> > 
> > there's just no way that after stripping all those columns there are no
> > duplicates.
> 
> This did work in constrast to the the more inclusive above set of columns.
> Reduced from 259 rows to 131. 
>  
> > can you drop the results into the wiki, so we can take a look.
> 
> Done here: https://libre-soc.org/openpower/opcode_regs_deduped

whatever software you used is not very good, cat | sort | uniq removed about 5 more!

it also converted numbers to floating point which we don't really want either.

> Although I haven't done the join with the original opcode column. Do you
> want just that final product or both?

given that we will need to add FP and otger opcodes, a manual process is probably not a good idea.  however it is a start and we need to know what we are looking at.

oh, now i have seen it: can you also "process" the LDSTLEN field so that it is either "1" or "0"?  this will cut another... 10-20 estimated rows.
Comment 13 Luke Kenneth Casson Leighton 2020-10-06 22:26:57 BST
ah, sone more:
* replace all CONST_* with CONST.
* replace all RA_OR_ZERO with RA

i am also looking up sgnext and sgn to find out if those are needed
Comment 14 Luke Kenneth Casson Leighton 2020-10-06 22:32:10 BST
(In reply to Luke Kenneth Casson Leighton from comment #13)

> i am also looking up sgnext and sgn to find out if those are needed

is_signed and sign_extend. nope. trash thise too.  and 32b.  these are all "qualifiers" that modify the results, but do not need "special treatment" by SV.
Comment 15 Luke Kenneth Casson Leighton 2020-10-06 22:44:33 BST
upd column can go too.  and CRin can be replaced with "1" or "0" and likewise CRout
Comment 16 Luke Kenneth Casson Leighton 2020-10-06 22:53:29 BST
more.... :)

* anyrhing with SPR in that row the ROW can be deleted
* all regs RA RB RT RS RC can be replaced with just R or REG
Comment 17 Luke Kenneth Casson Leighton 2020-10-07 00:22:18 BST
remove carry-in column
remove carry-out column
Comment 18 Cole Poirier 2020-10-07 02:21:12 BST
Done all of that... didn't realize you were going to be working on this concurrently, fortunately it doesn't seem like wasted effort as I appear to have very different results from yours, added mine below yours on the same wiki page.
Comment 19 Luke Kenneth Casson Leighton 2020-10-07 04:34:26 BST
(In reply to Cole Poirier from comment #18)
 seem like wasted effort as I appear to
> have very different results from yours, added mine below yours on the same
> wiki page.

that's because you didnt remove the opcode column which can be seen clearly in the output and also in comment #18

i messed up by removing rc=ONE. wark.

we need a program to do this job.

in the meantime can you take the latest microwatt decode1.vhdl and create, ONLY for the FP opcodes (grep FP) ONLY in the same minor group a CSV file dedicated to them.

it is a bit laborious, i tend to use vi "." (repeat last command) a lot.

we need to know the FP ops.  but do NOT add anything to existing CSV files so do NOT add the FP LOAD/STORE to major.csv for example.
Comment 20 Luke Kenneth Casson Leighton 2020-10-07 04:41:49 BST
decode_op_59_array
631
63h
https://github.com/antonblanchard/microwatt/blob/master/decode1.vhdl
Comment 21 Cole Poirier 2020-10-07 05:24:25 BST
(In reply to Luke Kenneth Casson Leighton from comment #19)
> (In reply to Cole Poirier from comment #18)
>  seem like wasted effort as I appear to
> > have very different results from yours, added mine below yours on the same
> > wiki page.
> 
> that's because you didnt remove the opcode column which can be seen clearly
> in the output and also in comment #18

Nope you misinterpreted it’s presence to mean I did’t drop it. Your instructions were to drop it but after deduplicating the data to left join with the original data to get the opcode column back. I didn’t even have to left join with the whole of the original data because the index of the data was preserved through the processing and I was able to left join only on the index and the opcode column.  So during the deduplication process there was no opcode column, but since we need the opcode column for our analysis I’ve added it back.

> i messed up by removing rc=ONE. wark.
> 
> we need a program to do this job.

Once we have more well defined parameters for it, I agree, but right now it seems we’re still figuring it out so you can just rely on me to do it manually until it is appropriate to automate it.

> in the meantime can you take the latest microwatt decode1.vhdl and create,
> ONLY for the FP opcodes (grep FP) ONLY in the same minor group a CSV file
> dedicated to them.

Will do tomorrow, though I will prioritize the icache banging test tomorrow because today I prioritized the csv’s and didn’t get to the icache.

> it is a bit laborious, i tend to use vi "." (repeat last command) a lot.

Aha good trick, thank you.

> we need to know the FP ops.  but do NOT add anything to existing CSV files
> so do NOT add the FP LOAD/STORE to major.csv for example.

Understood. I’ll add the FP csv only to our current working page for the deduped ops so as to not interfere with the decoder.
Comment 22 Cole Poirier 2020-10-07 05:27:52 BST
(In reply to Luke Kenneth Casson Leighton from comment #20)
> decode_op_59_array
> 631

I think it’s actually 63l, lower case ‘L’ instead of 1 (one), is that correct?

> 63h
> https://github.com/antonblanchard/microwatt/blob/master/decode1.vhdl

Thanks for the help :)
Comment 23 Luke Kenneth Casson Leighton 2020-10-07 05:37:37 BST
(In reply to Cole Poirier from comment #21)

> index and the opcode column.  So during the deduplication process there was
> no opcode column, but since we need the opcode column for our analysis I’ve
> added it back.

no, database don't work that way. JOINs work by taking 2 tables and producing results.

remove it. 

> > i messed up by removing rc=ONE. wark.
> > 
> > we need a program to do this job.
> 
> Once we have more well defined parameters for it, I agree, but right now it
> seems we’re still figuring it out so you can just rely on me to do it
> manually until it is appropriate to automate it.

the answer's no.  adding one single row or  changing one single thing due to a mistake or trying a different approach places you and the laborious manual process on the critical path.
Comment 24 Luke Kenneth Casson Leighton 2020-10-07 14:51:47 BST
so to clarify: the single entry opcode tgat you created is a "RIGHT JOIN LIMIT 1" which is not useful.  we need to know ALL the opcodes that match the "primary key" (the reduced table) for ALL primary keys.

just "one of them" per primary key, not knowing which CSV file they came from, what could we do with that information? not very much.

time for a program.
Comment 25 Luke Kenneth Casson Leighton 2020-10-07 16:04:37 BST
https://libre-soc.org/openpower/sv_analysis.py

writes tables as follows: https://libre-soc.org/openpower/opcode_regs_deduped/

this produces some fascinating and regular groupings, as well as raising
the question "why is rlwimi not in the same category as rldimi".
Comment 26 Cole Poirier 2020-10-07 16:39:21 BST
(In reply to Luke Kenneth Casson Leighton from comment #24)
> so to clarify: the single entry opcode tgat you created is a "RIGHT JOIN
> LIMIT 1" which is not useful.  we need to know ALL the opcodes that match
> the "primary key" (the reduced table) for ALL primary keys.
> 
> just "one of them" per primary key, not knowing which CSV file they came
> from, what could we do with that information? not very much.
> 
> time for a program.

Ahhh, I finally understand what you wanted. Yes I could do that by just modifying the parameter of the join, but I didn’t understand until now what you wanted. Glad you were able to quickly put together a script that did what you wanted so that I didn’t become a bottleneck :)
Comment 27 Luke Kenneth Casson Leighton 2020-10-07 18:41:04 BST
alain's currently having a look, to change sv_analysis to "count" the number of input registers rather than have them separate, in1/2/3.

cole if you can do decode_op_59_array i will do op_63l.
Comment 28 Luke Kenneth Casson Leighton 2020-10-07 18:52:26 BST
(In reply to Luke Kenneth Casson Leighton from comment #27)
> alain's currently having a look, to change sv_analysis to "count" the number
> of input registers rather than have them separate, in1/2/3.
> 
> cole if you can do decode_op_59_array i will do op_63l.

errr it's taken me 4 minutes to do 63L and H i might as well add 59...
Comment 29 Luke Kenneth Casson Leighton 2020-10-07 19:03:29 BST
(In reply to Luke Kenneth Casson Leighton from comment #28)
> (In reply to Luke Kenneth Casson Leighton from comment #27)
> > alain's currently having a look, to change sv_analysis to "count" the number
> > of input registers rather than have them separate, in1/2/3.
> > 
> > cole if you can do decode_op_59_array i will do op_63l.
> 
> errr it's taken me 4 minutes to do 63L and H i might as well add 59...

done. took 2 minutes with vim :%s/s/r/g patterns.  what's missing is
the "Forms".  Cole can you go through the v3.0B PDF spec and add them?
Comment 30 Cole Poirier 2020-10-07 22:09:17 BST
(In reply to Luke Kenneth Casson Leighton from comment #29)
> (In reply to Luke Kenneth Casson Leighton from comment #28)
> > (In reply to Luke Kenneth Casson Leighton from comment #27)
> > > alain's currently having a look, to change sv_analysis to "count" the number
> > > of input registers rather than have them separate, in1/2/3.
> > > 
> > > cole if you can do decode_op_59_array i will do op_63l.
> > 
> > errr it's taken me 4 minutes to do 63L and H i might as well add 59...
> 
> done. took 2 minutes with vim :%s/s/r/g patterns.  what's missing is
> the "Forms".  Cole can you go through the v3.0B PDF spec and add them?

Sure, just manually right?
Comment 31 Luke Kenneth Casson Leighton 2020-10-07 22:32:52 BST
(In reply to Cole Poirier from comment #30)
> > the "Forms".  Cole can you go through the v3.0B PDF spec and add them?
> 
> Sure, just manually right?

yes, no real practical other way.
Comment 32 Jacob Lifshay 2020-10-08 00:21:22 BST
My comments on jitsi call:
what about redirecting cr* to integer registers for vector compares?
kinda like Luke did with branches on RISC-V
we can put the condition code in the prefix on compare instructions
Comment 33 Luke Kenneth Casson Leighton 2020-10-08 00:55:50 BST
(In reply to Jacob Lifshay from comment #32)
> My comments on jitsi call:
> what about redirecting cr* to integer registers for vector compares?

interesting thought.. ah wait so instead of the CR going into an actual CR it goes directly into an int regfile?

so sort-of an implicit vector-mcrf?

my concern with that idea although it has merit in that we would not need to expand CR to 64 entries is, it kinda breaks the way PowerISA works.

it does have advantages so let's add it to the list of options, comparing it against simply "vectorising mcrf".

the reason for that being, CR operations are designed to operate at the bitlevel where INT operations are not.

> kinda like Luke did with branches on RISC-V
> we can put the condition code in the prefix on compare instructions

yes or it is implicit.  or, ah, we perhaps reserve a bit to say whether the vector-of-compares is to be stored in *one* CR or whether each compare is to be stored *individually*.

and if there is space (which there will likely be in the SV-P64) we can say if that single result is to be an "all", "some" or "none" of the compares.

or, it may be possible to introduce that into the vectorisation of crand/cror/etc CR logic operations, if the src is a vector and the dest a scalar.

one thing though: i am really really leaning towards completely ignoring the XER SO field for SV when it comes to CRs, and potentially giving that bit in the CRs a new much more useful purpose.  SO causes massive headaches for OoO and is so rarely used i think we can get away with it.

which, if we do that, we can potentially also repurpose the OE field in the scalar instructions as a 12th bit in SV-P48.
Comment 34 Jacob Lifshay 2020-10-08 01:19:25 BST
(In reply to Luke Kenneth Casson Leighton from comment #33)
> (In reply to Jacob Lifshay from comment #32)
> > My comments on jitsi call:
> > what about redirecting cr* to integer registers for vector compares?
> 
> interesting thought.. ah wait so instead of the CR going into an actual CR
> it gies directly into an int regfile?
> 
> so sort-of an implicit vector-mcrf?
> 
> my concern with that idea although it gas merit in that we would not need to
> expand CR to 64 entries is, it kinda breaks the way PowerISA works.
> 
> it does have advantages so let's add it to the list of options, comparing it
> against simply "vectorising mcrf".
> 
> the reason for that being, CR operations are designed to operate at the
> bitlevel where INT operations are not.

the idea is that the compare would produce 1 bit per vector lane and essentially directly generate a predicate mask into an integer register. For that to work, the compare would need extra bits (normally in the branch instruction for scalar powerpc) to know which of lt, le, eq, ne, etc. it should use, those bits come from the prefix.

As long as it's one bit per lane, scalar integer ops are even better than cr ops for the required bit manipulations.
Comment 35 Luke Kenneth Casson Leighton 2020-10-08 06:05:35 BST
(In reply to Jacob Lifshay from comment #34)

> the idea is that the compare would produce 1 bit per vector lane and
> essentially directly generate a predicate mask into an integer register. For
> that to work, the compare would need extra bits (normally in the branch
> instruction for scalar powerpc) to know which of lt, le, eq, ne, etc. it
> should use, those bits come from the prefix.
> 
> As long as it's one bit per lane, scalar integer ops are even better than cr
> ops for the required bit manipulations.

i came up with an architectural plan to implement the hidden bitsetting in 6600 style OoO and to be honest it was a bit of a pig.

an exception in the middle required a very messy design.

CRs on the other hand by being treated as actual "real" registers respected and each given their own Dependency Matrix column are far easier to handle.

exceptions in the middle of that, no problem, just restore VL forloop where it left off.

bortom line is that PowerISA has condition registers which store results that you then decide which bits to test to make different branches, i.e. the compare is separated from the branch *by* the CR.

this is conceptually similar to RV FP compare except it wastes an entire 64 bit int reg to do it (RV FP cmp stores 1 or 0 in an int reg for FP GT/LT/LE/NE ops which you then follow up with an integer BEQzero)

PowerISA *specifically* has these 4bit CRs  and i feel we should go with the flow on that rather than try to invent an alternative condition scheme that does not mesh with what the original PowerISA designers envisaged (for scalar)

think of it this way: a single bit predicate of compares effectively throws away the other 2 bits of the same op if using CR, doesn't it?

so to replicate that exact same behaviour it would be necessary to call at least 3 vector compares (single bit predicate) and even use 3 separate int regs to do so just to get what could have been done with only a single vector CR based compare.

unless i have misunderstood this does not sound like a step forwards! :)
Comment 36 Luke Kenneth Casson Leighton 2020-10-08 07:36:31 BST
right. ok. idea.  predicates do need to be created, but (sticking with a Power-like design principle) the cmps are in CRs. therefore, the logical way (to also preserve vector-cmp-CR behaviour) is to have mfcr copy *multiple* bits from a vector of CRs into the destination reg, to create a predicate mask.

that predicate mask can them be applied to subsequent vector ops (including even CR ops such as crand, mfcr and mtcrf if we decide that's a good idea)

however... i probably don't mean mfcr, above :)  mfcr is designed to copy whole CRs

https://libre-soc.org/openpower/isa/sprset/

setb? no, that one is designed to turn a CR (which just tested whether a result was +ve -ve or 0) into integer +1 -1 or 0.


we want a *bit* of a CR.  vectorised-isel?

https://libre-soc.org/openpower/isa/fixedtrap/

that looks better, esp. with predication applied to it to stop forced-zeroing.

for i in range(VL):
 if predicatemode & INTPRED[i] == 0 skip
  if CR[BC+32+i*4]=1 then RT[i] <- (RA|0)
  else                    RT[i] <- (RB)

that i think jacob is "effectively" what you are suggesting (is that right?) except it uses CRs as an intermediary to get there.

it has the advantage that BC in the isel instruction can specify the offset which allows the appropriate CR bit to be selected.

this is the "equivalent" of the idea of using some of the SVPrefix bits to choose LE/GT/EQ but not actually needing to use precious SVPtefix bits to do so!

those could instead be used to specify:

* isel ordinary  scalar/vector RT mode
* isel "hey treat RT as a pred" mode
* multipliers on BC that allow it to reach the full 64 CRs.

reason for this last one: BC is only 5 bits (0-31) to select any one of 32 bita of the scalar mode CR reg.

but vector CR to store 64 CRs needs 3 more bits added to BC because scalar mode CR reg has only 8 CRs CR0-CR7.

we want vector CR0-CR63 whixh is 8x as many CRs so we need 3 extra bits on BC.
Comment 37 Luke Kenneth Casson Leighton 2020-10-08 20:09:29 BST
(In reply to Luke Kenneth Casson Leighton from comment #33)

> one thing though: i am really really leaning towards completely ignoring the
> XER SO field for SV when it comes to CRs, and potentially giving that bit in
> the CRs a new much more useful purpose.

i wonder if we should, in SV mode, use the CR SO bit as a predicate? and use the OE bit of instructions to say that "predication is to be enabled" or not.

hmmm....
Comment 38 Jacob Lifshay 2020-10-08 20:36:42 BST
(In reply to Luke Kenneth Casson Leighton from comment #35)
> (In reply to Jacob Lifshay from comment #34)
> 
> > the idea is that the compare would produce 1 bit per vector lane and
> > essentially directly generate a predicate mask into an integer register. For
> > that to work, the compare would need extra bits (normally in the branch
> > instruction for scalar powerpc) to know which of lt, le, eq, ne, etc. it
> > should use, those bits come from the prefix.
> > 
> > As long as it's one bit per lane, scalar integer ops are even better than cr
> > ops for the required bit manipulations.
> 
> i came up with an architectural plan to implement the hidden bitsetting in
> 6600 style OoO and to be honest it was a bit of a pig.
> 
> an exception in the middle required a very messy design.

Why would you ever need to handle exceptions in the middle of a cmp instruction? cmp instructions won't ever generate their own exceptions and interrupts would generate a fake interrupt instruction which would go either before or after the sequence of cmp instructions.

All we need to do is design cmp to target only one of 2 integer registers and tell the scheduler that they only write to their destination register and add individual bit write enables on those 2 integer registers or just treat each bit as a separate register. The rest of the destination register selection bits can be used to encode the compare condition, along with the two other reserved bits in the cmp[l] instructions.

We could also keep the above encoding with cmp targeting one of 2 integer registers and have a internal non-ISA-visible register for mask accumulation that is copied to the destination integer register. The final copy could be included in the final lane compare instruction or split out as a separate micro-op.

> CRs on the other hand by being treated as actual "real" registers respected
> and each given their own Dependency Matrix column are far easier to handle.
> 
> exceptions in the middle of that, no problem, just restore VL forloop where
> it left off.
> 
> bortom line is that PowerISA has condition registers which store results
> that you then decide which bits to test to make different branches, i.e. the
> compare is separated from the branch *by* the CR.
> 
> this is conceptually similar to RV FP compare except it wastes an entire 64
> bit int reg to do it (RV FP cmp stores 1 or 0 in an int reg for FP
> GT/LT/LE/NE ops which you then follow up with an integer BEQzero)

An entire int reg --- we have 128, losing 1 won't hurt much, especially since we'd need it for masking vector ops anyway (what compare results are usually used for).

> PowerISA *specifically* has these 4bit CRs  and i feel we should go with the
> flow on that rather than try to invent an alternative condition scheme that
> does not mesh with what the original PowerISA designers envisaged (for
> scalar)

There are other issues with the CRs: several of them are callee-save so any function using vectorized compare would usually need to save and restore the CRs.

> think of it this way: a single bit predicate of compares effectively throws
> away the other 2 bits of the same op if using CR, doesn't it?
> 
> so to replicate that exact same behaviour it would be necessary to call at
> least 3 vector compares (single bit predicate) and even use 3 separate int
> regs to do so just to get what could have been done with only a single
> vector CR based compare.

Except that you rarely need more than one compare result, so all the extra bits are usually ignored.

Also, the isel instruction doesn't seem to have the right semantics: what if you want floating-point ge where the defined semantics are you need the output to be set if the greater or equal bits are set, but not the less or unordered bits? (the unordered bit is where integer compares put SO) There isn't any one bit you could pick out of the CR that gives the required combination.

The other benefit of having the compare instruction directly generate the mask is that now or in the future implementations could need less clock cycles to execute due to taking less instructions, and also it takes less i-cache space.
Comment 39 Jacob Lifshay 2020-10-08 20:38:13 BST
(In reply to Luke Kenneth Casson Leighton from comment #37)
> (In reply to Luke Kenneth Casson Leighton from comment #33)
> 
> > one thing though: i am really really leaning towards completely ignoring the
> > XER SO field for SV when it comes to CRs, and potentially giving that bit in
> > the CRs a new much more useful purpose.

If we use CRs for compare results, we still need the SO bit since it is used for "unordered" for floating-point compares.

> i wonder if we should, in SV mode, use the CR SO bit as a predicate? and use
> the OE bit of instructions to say that "predication is to be enabled" or not.
> 
> hmmm....
Comment 40 Luke Kenneth Casson Leighton 2020-10-08 22:59:51 BST
(In reply to Jacob Lifshay from comment #38)
> (In reply to Luke Kenneth Casson Leighton from comment #35)
> > (In reply to Jacob Lifshay from comment #34)
> > 
> > > the idea is that the compare would produce 1 bit per vector lane and
> > > essentially directly generate a predicate mask into an integer register. For
> > > that to work, the compare would need extra bits (normally in the branch
> > > instruction for scalar powerpc) to know which of lt, le, eq, ne, etc. it
> > > should use, those bits come from the prefix.
> > > 
> > > As long as it's one bit per lane, scalar integer ops are even better than cr
> > > ops for the required bit manipulations.
> > 
> > i came up with an architectural plan to implement the hidden bitsetting in
> > 6600 style OoO and to be honest it was a bit of a pig.
> > 
> > an exception in the middle required a very messy design.
> 
> Why would you ever need to handle exceptions in the middle of a cmp
> instruction? 

not in the middle of one cmp instruction: an exception in the middle of a
*vector* batch of up to *64* cmp instructions.

> cmp instructions won't ever generate their own exceptions and
> interrupts would generate a fake interrupt instruction which would go either
> before or after the sequence of cmp instructions.

if you forcibly mask out interrupts (exceptions) in the middle of a vector, latency goes to s*** :)

if you decide to "throw away results and start again" but have written *any* part of those results - including any bits of the hidden predicate - to any regfile, now you have irrecoverable data corruption when "restarting from element zero".

reasons why you would want to write partial results include that the microarchitectural internal vector length (back-end SIMD in our case) may only be 4-wide or 8-wide and the requested vector length is 16 or above.


> All we need to do is design cmp to target only one of 2 integer registers

2 integer registers as reserved as predicates for the whole of the vector
of cmps?

if the answer to that is "yes", i didn't make a fuss about it but the OoO scheduling for that is an absolute pig.

it basically means that batches of elements actually depend on the same (integer) register, making it a *multi* targetted Write Hazard.

contrast this to each cmp independently targetting a completely independent CR that has a completely independent Dependency Matrix column that in no way overlaps with any other element, and it should be pretty clear that the use of CRs is an absolutely massive design simplification.


> and tell the scheduler that they only write to their destination register
> and add individual bit write enables on those 2 integer registers or just
> treat each bit as a separate register. 

right.  this requires the creation of at least a 32-wide Dependency Matrix just to cover individual bits of a register.

i mean - it _works_... but here's the thing: *that's exactly what's going to have to be done for the Condition Registers*.

so in addition to (say) a minimum of 32-wide DM columns added for CRs, on top of that you're proposing an ADDITIONAL 32 DM columns for covering single-bit predication...

... when we can entirely skip that by using one of the bits *of* the CRs *as* the very predicate source/target bit that you're proposing

and skip a whopping 32 x 20 (or so) extra DM entries.


> The rest of the destination register
> selection bits can be used to encode the compare condition, along with the
> two other reserved bits in the cmp[l] instructions.

ah there's spare bits?  ah!  well, we need to be very careful about using them, in particular we need explicit approval from the OpenPOWER Foundation to do so (even though we are doing this entirely behind a "libresoc modeswitch").


> 
> We could also keep the above encoding with cmp targeting one of 2 integer
> registers and have a internal non-ISA-visible register for mask accumulation
> that is copied to the destination integer register.

this idea i came up with for SV-RISC-V "branch", and it's quite dodgy.  doable, but dodgy.

> The final copy could be
> included in the final lane compare instruction or split out as a separate
> micro-op.

if we didn't have any other better options (using CRs as-is for their intended purpose in scalar world, just "vectorised") i'd say yes, let's do it, because i know what you're referring to, and had to design it for SV-RISC-V "branch".

OpenPOWER ISA, due to the fact that CRs exist, has no need for this kind of hard-hack.

> An entire int reg --- we have 128, losing 1 won't hurt much, especially
> since we'd need it for masking vector ops anyway (what compare results are
> usually used for).

some cross-over occurred here: i'm proposing that we use *CRs* for predicate masking, not an int from the int regfile :)

as i mentioned above this results in a massive simplification of the microarchitectural implementation, in particular it removes a thorny / problematic area that i never really liked, which was that the predicated vector operation is forced to stall until the INT regfile predicate mask has been read.

worse than that: if the bits turn out to be mostly zero, you just wasted pretty much the entire bandwidth of the CPU, maxed out the Reservation Stations *for no good reason*.

using separate and distinct CRs (as in the pseudocode from comment #36) these are *really easily* schedulable, and resolve incredibly easily as well, without slowing down the entire OoO engine waiting for reading of one single integer register.



> > PowerISA *specifically* has these 4bit CRs  and i feel we should go with the
> > flow on that rather than try to invent an alternative condition scheme that
> > does not mesh with what the original PowerISA designers envisaged (for
> > scalar)
> 
> There are other issues with the CRs: several of them are callee-save so any
> function using vectorized compare would usually need to save and restore the
> CRs.

callee-save... this is an ABI design issue?  that's solveable by avoiding conflicting with CR0-7 i.e. vectorised-CR uses CR8 and above.


> > think of it this way: a single bit predicate of compares effectively throws
> > away the other 2 bits of the same op if using CR, doesn't it?
> > 
> > so to replicate that exact same behaviour it would be necessary to call at
> > least 3 vector compares (single bit predicate) and even use 3 separate int
> > regs to do so just to get what could have been done with only a single
> > vector CR based compare.
> 
> Except that you rarely need more than one compare result, so all the extra
> bits are usually ignored.

in standard scalar operations yes, however in predicated vector operations it's a different story.

> Also, the isel instruction doesn't seem to have the right semantics:

yeah i can't seem to find something in the scalar ISA that "just operates on transferring of bits from CR to GPR".  anything involving RA/RT works on batches of 4 bits of the full CR.

> what if
> you want floating-point ge where the defined semantics are you need the
> output to be set if the greater or equal bits are set, but not the less or
> unordered bits? 

errr... i've not looked at the FP instructions (chapter 4, p123 v3.0B) at all.
i know it generates CR1 (when Rc=1)

thinking it through "out loud" so to speak, i'd say... mmmm... you'd do a FP subtract (with Rc=1), this would normally store in CR1.

however if vectorised, it would put them into say.... CR8....CR(8+VL-1)

then you could perform standard crnand/cror operations on them to compute the required predicate bit (in those exact same CRs if you wanted) and go from there

> (the unordered bit is where integer compares put SO) There
> isn't any one bit you could pick out of the CR that gives the required
> combination.

ah that's interesting.  no FP ops (p154) have an OE=1.  soOooo...
continuing this in reply to comment #39

> The other benefit of having the compare instruction directly generate the
> mask is that now or in the future implementations could need less clock
> cycles to execute due to taking less instructions, and also it takes less
> i-cache space.

this by turning the underlying microarchitecture into CISC (explained above).  it's a complex trade-off.
Comment 41 Luke Kenneth Casson Leighton 2020-10-08 23:07:11 BST
(In reply to Jacob Lifshay from comment #39)
> (In reply to Luke Kenneth Casson Leighton from comment #37)
> > (In reply to Luke Kenneth Casson Leighton from comment #33)
> > 
> > > one thing though: i am really really leaning towards completely ignoring the
> > > XER SO field for SV when it comes to CRs, and potentially giving that bit in
> > > the CRs a new much more useful purpose.
> 
> If we use CRs for compare results, we still need the SO bit since it is used
> for "unordered" for floating-point compares.

right.  ok.  so FP ops i see don't have an OE bit.  which tends to suggest
that only the INT ops could have the trick of using SO as a "spare predicate
bit" and using OE to indicate that it's enabled.

for FP, yes, we would need to leave it alone, and use one of the prefix bits
to specify whether the FP op is to be predicated or not (and which CR bit
to select to do so)

however - what we *can* do - after each vectorised FPop has produced a
vectorised CR array (including still having that "unordered" bit) is use
vectorised crand/cror/mfcr (etc) to post-analyse the *array* of CR bits
in a useful manner.

such as masking out any "unordered" operations (saving CPU cycles and Reservation Stations at the ALU in the process)

the idea of transferring to INT-GPRs by vectorising isel is still "a valid possibility" but i think by keeping predicates in CRs we don't *need* to transfer to intregs, and consequently wouldn't need that "special bit-wise RT-to-CR" operation which doesn't seem to exist in PowerISA anyway

hmmm....
Comment 42 Jacob Lifshay 2020-10-08 23:15:43 BST
(In reply to Luke Kenneth Casson Leighton from comment #41)
> such as masking out any "unordered" operations (saving CPU cycles and
> Reservation Stations at the ALU in the process)

NaNs (which are what produce "unordered") are in almost all cases not common enough to be worth running any extra mask manipulation instructions.
Comment 43 Jacob Lifshay 2020-10-08 23:55:31 BST
(In reply to Luke Kenneth Casson Leighton from comment #40)
> (In reply to Jacob Lifshay from comment #38)
> > (In reply to Luke Kenneth Casson Leighton from comment #35)
> > > an exception in the middle required a very messy design.
> > 
> > Why would you ever need to handle exceptions in the middle of a cmp
> > instruction? 
> 
> not in the middle of one cmp instruction: an exception in the middle of a
> *vector* batch of up to *64* cmp instructions.

that's what I meant.

> > cmp instructions won't ever generate their own exceptions and
> > interrupts would generate a fake interrupt instruction which would go either
> > before or after the sequence of cmp instructions.
> 
> if you forcibly mask out interrupts (exceptions) in the middle of a vector,
> latency goes to s*** :)

not by that much ... if doing f32x32 with a f32x4 SIMD execution unit you get only 8 + pipeline-length cycles of latency -- not that much. Just think, we can process an interrupt in less time than it takes to load the interrupt handler code from DRAM, since it's probably not in the cache anyway.
> > All we need to do is design cmp to target only one of 2 integer registers
> 
> 2 integer registers as reserved as predicates for the whole of the vector
> of cmps?

yes, but we *can* treat each bit or group of 2 or 4 bits like independent registers, avoiding the need for blocking exceptions during cmp and also allowing starting a masked operation before the cmp that produces the mask finishes executing -- the dependencies just operate on lanes rather than the whole mask.

The other benefit of using the integer registers for mask registers is that we can use all the weird  and wonderful bitwise ops on them (popcount, find first set, or find last set) that aren't supported on CR registers as well as trivially having many more registers for storing multiple masks -- which is quite important on vectorized code with many different conditions that get combined together; we would have very high pressure on the CR registers due to only being able to fit a few masks in them since a 64-lane compare would overwrite all 4 bits in all registers.

> if the answer to that is "yes", i didn't make a fuss about it but the OoO
> scheduling for that is an absolute pig.

if we split those 2 integer registers that you can write cmp results to (and not any others) into many small groups of bits that resolves the OoO scheduling concern.

Instructions should be able to use more than 2 integer registers as mask inputs since read dependencies are not a concern.

> it basically means that batches of elements actually depend on the same
> (integer) register, making it a *multi* targetted Write Hazard.
> 
> contrast this to each cmp independently targetting a completely independent
> CR that has a completely independent Dependency Matrix column that in no way
> overlaps with any other element, and it should be pretty clear that the use
> of CRs is an absolutely massive design simplification.

I'm advocating for those two integer registers being the targets for all vector instructions that produce masks because integer registers have the benefits of being an integer register without the drawbacks of the CR registers as explained above.
> 
> 
> > and tell the scheduler that they only write to their destination register
> > and add individual bit write enables on those 2 integer registers or just
> > treat each bit as a separate register. 
> 
> right.  this requires the creation of at least a 32-wide Dependency Matrix
> just to cover individual bits of a register.
> 
> i mean - it _works_... but here's the thing: *that's exactly what's going to
> have to be done for the Condition Registers*.
> 
> so in addition to (say) a minimum of 32-wide DM columns added for CRs, on
> top of that you're proposing an ADDITIONAL 32 DM columns for covering
> single-bit predication...

ummm... you only need the 8 existing CRs if using the integer registers for masking.

> > > think of it this way: a single bit predicate of compares effectively throws
> > > away the other 2 bits of the same op if using CR, doesn't it?
> > > 
> > > so to replicate that exact same behaviour it would be necessary to call at
> > > least 3 vector compares (single bit predicate) and even use 3 separate int
> > > regs to do so just to get what could have been done with only a single
> > > vector CR based compare.
> > 
> > Except that you rarely need more than one compare result, so all the extra
> > bits are usually ignored.
> 
> in standard scalar operations yes, however in predicated vector operations
> it's a different story.

no, it *is* basically the same. you don't generally need eq, gt, and lt all from the same compare -- all you need is predication based on "did the corresponding lane of the vector compare spit out a true" and the corresponding "did it spit out a false" case, both of which can be achieved with a single mask by having a invert-mask bit in the SimpleV prefix for predicated instructions.
Comment 44 Luke Kenneth Casson Leighton 2020-10-09 01:07:10 BST
(In reply to Jacob Lifshay from comment #43)

> yes, but we *can* treat each bit or group of 2 or 4 bits like independent
> registers, avoiding the need for blocking exceptions during cmp and also
> allowing starting a masked operation before the cmp that produces the mask
> finishes executing -- the dependencies just operate on lanes rather than the
> whole mask.

bit tired (00:20 here) will go over implications in followup another time.
 
> The other benefit of using the integer registers for mask registers is that
> we can use all the weird  and wonderful bitwise ops on them (popcount, find
> first set, or find last set) that aren't supported on CR registers 

i remember now.  that was one of the big advantages of SV-RV.

> as well
> as trivially having many more registers for storing multiple masks -- which
> is quite important on vectorized code with many different conditions that
> get combined together; we would have very high pressure on the CR registers
> due to only being able to fit a few masks in them since a 64-lane compare
> would overwrite all 4 bits in all registers.

yyyeah, true.

arg.  i really am not keen on pausing the execution of vector ops to read an int reg.  it's doable: a predicate "shadow" unit reads the int, and pulls "die" or "release shadow" on its respective element once available.

that's pretty straightforward.

writing individual bits out to an int reg from a batch of cmps is however a very different matter.

micro-architecturally it would be better to extend the CRs to 128 (16x 64 bits) 

other options: just to expect people to reduce VL to saner sizes when issuing cmp instructions

then use mtocrf, isel, something anything to transfer the required CR bits to an "int as predicate".


sort-of like the micro-op idea you mentioned a few comments back except using an explicit instruction for it instead.

the micro-op route is how RISC gets turned into CISC if you're not careful

with 128 CRs there is less pressure plus bitselection and transfer to INT GPRs allows further processing (popcount) plus use it as a backup cache.

i need to mull this over.


> > if the answer to that is "yes", i didn't make a fuss about it but the OoO
> > scheduling for that is an absolute pig.
> 
> if we split those 2 integer registers that you can write cmp results to (and
> not any others) into many small groups of bits that resolves the OoO
> scheduling concern.

no: it massively increases the size of the Dependency Matrices.  we are already at the "alarmingly large size to the point where we may have to do a PRF-ARF i.e. have register caches".

and when you have bit-divisions of a reg you need to simultaneously pull the main 
64 bit int DM column *and the bits as well*

but.. which bits? the real implications are that you need to have full bitlevel DMs - as in 64x128 DM columns! times 20 FU rows!

which is where the PRF-ARF allocation comes in.  intregs allocated as predicates would be allocated the "64 bit-wise DM columns" (which is far too many, so we'd need to drastically cut that back to e.g. 16 max)

now we need a register cache mapping from registers in the PRF to *parts* (ranges of bits!) of the reg bitlevel cache!

you see how insanely complex that's getting?

for CRs it is (slightly) less complex because they're already subdivided into small chunks.


> Instructions should be able to use more than 2 integer registers as mask
> inputs since read dependencies are not a concern.

yeah, they are.  all dependencies have to be respected and analysed.  they absolutely cannot be ignored.

> > it basically means that batches of elements actually depend on the same
> > (integer) register, making it a *multi* targetted Write Hazard.
> > 
> > contrast this to each cmp independently targetting a completely independent
> > CR that has a completely independent Dependency Matrix column that in no way
> > overlaps with any other element, and it should be pretty clear that the use
> > of CRs is an absolutely massive design simplification.
> 
> I'm advocating for those two integer registers being the targets for all
> vector instructions that produce masks because integer registers have the
> benefits of being an integer register without the drawbacks of the CR
> registers as explained above.

int regs as dest for mask really is much more complex than you currently imagine.


> ummm... you only need the 8 existing CRs if using the integer registers for
> masking.

then how are Rc=1 operations treated?

do all vector INT ops try to all write to CR0, and all FP ops write to CR1?

last element (VL-1) writes, all other writes are destroyed?

this is where the idea of extending CR to at least 64 elements comes from - nothing to do with predication.


> no, it *is* basically the same. you don't generally need eq, gt, and lt all
> from the same compare 

i can see exactly this being very useful, particularly when involving predicate masks (on the crand/or as well) to e.g. generate a min-max range-selection or other complex suite.

> -- all you need is predication based on "did the
> corresponding lane of the vector compare spit out a true" and the
> corresponding "did it spit out a false" case, both of which can be achieved
> with a single mask by having a invert-mask bit in the SimpleV prefix for
> predicated instructions.

to create more complex compound masking (range exclusion being one example i can think of initially) would i think you'll find need quite a few more instructions that then have to move to the yes/no mask.

where CRs have already been designed to cover quite a bit more than other RISC ISAs.

that said popcount and ffirst and the bitpattern propagation etc these are not just valuable they're essential in vector ISAs.

it's going to need a lot more thought, however at the moment i am leaning towards a hybrid that includes best practical parts of both.
Comment 45 Luke Kenneth Casson Leighton 2020-10-09 15:38:51 BST
(In reply to Luke Kenneth Casson Leighton from comment #44)

> then how are Rc=1 operations treated?

remember, jacob, unlike e.g. RISCV which can only do conditional test/branch through BEQ/BNE, PowerISA had these Rc=1 modes which test the arithmetic result.  the logic gates needed to do so is quite small so is easily justifiable, when you get a "free eq/gt/lt" comparison as part of every arithmetic op.

it would be anomalous to only modify cmp to become a single-bit comparison: questions would be asked by experienced (puzzled) PowerISA architects, "why on earth are you proposing cmp be reduced in capability to single bit instead of a full CR in the first place, and why only cmp rather than all Rc=1 operations?"

we have to be able to justify that to people with *25* years experience in PowerISA and i definitely don't have good answers, additionally being instead much more in favour of respecting PowerISA scalar capability (Rc=1 included) just extended to the parallel domain.  (general rule of SV: don't change the scalar behaviour without a damn good reason)

if however predicates must be *applied* via bitwise (from INT GPRs) then popcount (etc) gets a word in edgeways.
Comment 46 Luke Kenneth Casson Leighton 2020-10-09 17:12:17 BST
so to explain: anything exclusively "in-flight" which has inter-dependencies is seriously problematic.

by "in-flight" this refers to all intermediary results that have come from registers (GPR, XER, CRs) and have not yet been stored back in the same.

so let's go over the micro-architectural design implications of the idea of:

* computing vector results
* creating a comparison vector
* creating a "scalar summary" of that comp vector
* using a micro-op to do so

the requirements in an OoO design are: the entire operation MUST be atomic as far as dependencies are concerned

OR

it must be "re-entrant" i.e. possible to interrupt, store state, and continue.

(this latter you rejected, jacob, advocating instead "complete the vector op, disregard latency" which has detrimental implications i won't go into right now).

so we are going with "full atomic high latency transaction with in-flight micro-ops" but still permitting cancellation because whilst interrupts are "reasonable" to ignore, cancellation most certainly is not.

with that in mind, we can begin the architectural design analysis.

* the micro-op must have access to the bitvector of compares in an in-flight register.
* the micro-op must have access to ALL elements of the in-flight vector (all 64)
* the RESULT vector must ALSO be in-flight i.e. not permitted to write to GPRs during this time
* why? because the scalar compare is not ready yet and the entire operation is both atomic and cancellable
* therefore we must have 64 Reservation Stations to hold that in-flight data
* this is PER PIPELINE (!)
* therefore there must be 64 LOGICAL RSes, 64 ALU RSes, 64 FP RSes, 64 SHIFTROT RSes, 64 DIV RSes.

this comes to a whopping 200+ Reservation Stations which would require somewhere of the order of 2 million gates.

i'm going through this in detail to show you that even the "simplest-sounding" idea has far-reaching microarchitectural implications that can end up as completely impractical to implement.

similar logic also applies to the "simple-sounding" idea of creating bitlevel Dependency Matrices.

a "re-entrant" design (one that does not have the micro-op creating an unnecessary dependency) on the other hand can be limited to a "window" (Lanes) at whatever silicon depth the implementor sees fit to choose.

shadowing (mask cancellation) is actually still a bottleneck in the re-entrant case, where the in-flight vector is too large for the RSes however it is perfectly normal to "stall" further issue until the end of the vector instruction is reached.  thetefore high performance (OoO issue) is achieved by keeping VL well within the bounds of the available RSes.

however for smaller VLs re-entrancy is perfectly possible and achievable (and does not cause data corruption when no "shadow" crosses that operation) by allowing partial vector results to be stored fully before the interrupt point, then saving the current hardware value of "i" in "for i in range(VL)" as part of context-switched state.

the conditions here are that you cannot blithely save arbitrary elements anywhere in the vector, it *must* be "all elements from 0 to the current value of i" must be saved before the interrupt is allowed to proceed.

any inflight results *may* be mask-cancelled but you have to roll back "i" to the last fully saved elements before allowing the interrupt/exception to proceed.

these are the kinds of considerations that need to be taken into account even for the "simplest" sounding idea! it's pretty mental.
Comment 47 Luke Kenneth Casson Leighton 2020-10-13 05:37:17 BST
we need a page on the wiki for "sv openpower predication" and list the options discussed here.
Comment 48 Luke Kenneth Casson Leighton 2020-10-13 05:54:09 BST
https://libre-soc.org/openpower/openpower/sv/predication/?updated

ideas to be expanded there (cut paste comments from here).

another idea, expanding on override of CR SO/OV field: copy the incoming predicate bit for that element and store it in SO foeld of CR associated with that element result.

this would make it possible to - after the operation - use CR ops (crand/or/etc) combining them with the comparison results (GT, LT, EQ) before then copying them back out to an intreg and using the intreg as another predicate mask, but *without* needing the extra ANDing of the *previous* intreg predicate bitmask with the new one.

(note: at the point at which the bits are copied from CRs to an intreg popcount (etc) can be done.)
Comment 49 Luke Kenneth Casson Leighton 2020-10-16 17:06:14 BST
(In reply to Jacob Lifshay from comment #5)
> the chosen prefixes should mesh well with the 64-bit instructions that were
> added in v3.1 of the spec.

unfortunately the use of one extra major opcode by the 64 bit prefixes is mutually exclusively incompatible with using 8 major opcodes for SV P32, P48, P64, VBLOCK and planned Compressed.

(this is why i designed the data pointer concept which effectively implicitly merges a (micro-op) LDST into any immediate operation and gives arbitrary overloads of the immediate length to anything the programmer desires.)

to get 8 major opcodes free and clear is essential to providing enough bit bandwidth for SV Prefixing in the different types.  those 8 major opcodes are for example the only reason why we have 11 bits available as prefixes in SVP32 and P48... and consequently would not need to throw away several weeks design work that you did on SVP, Jacob.

(6 bits per major opcode leaves only 10 bits for SV P32 and P48.  that is not enough.  however by using *two* major opcodes, in pairs, you get 11 bits.  we need P48, P64, P32, VBLOCK and Compressed, that's 4x2 major opcodes and v3.1 instructions took one of them).
Comment 50 Luke Kenneth Casson Leighton 2020-10-18 06:02:34 BST
source:
https://www.phoronix.com/forums/forum/hardware/graphics-cards/1207878-libre-soc-still-persevering-to-be-a-hybrid-cpu-gpu-that-s-100-open-source?p=1213364#post1213364

I'm looking at this CR thing for a while now, digging into that bug report, and the Power ISA specification, and not really getting any great ideas.

One really bad idea - Ignore the CR and add a byte of mask at the bottom of each GPR. But of course the would make register spill/save a nighmare. Plus doesn't really help with GT/LT/EQ.

One start to an idea was to expand the CR bit feild into byte feilds (plus mask). Also seems more terrible the more I think of it. If you were only ever doing 8x simd maybe.
Comment 51 Jacob Lifshay 2020-10-19 00:57:45 BST
After spending some time to think, I think I came up with an idea:

I think we should go back to our ideal requirements for the ISA, therefore, I think we should account for the following:
1. We should design the ISA to be what would work well on future processors.
2. We should not add in extra ISA-level steps just because our current microarchitecture might require them, that would hamstring future microarchitectures that don't need the extra steps.
3. It's fine for our current microarchitecture to be non-optimal for less common cases, such as very large VL values.
4. It's ok to add new instructions where necessary, we're doing new things after all.
5. It's ok to deviate from how Power's scalar ISA does things when there's a better way.

Based on the previous points, therefore I think we should do the following:

Use integer registers for vector masks.
I honestly think the CR registers are somewhat of a wart of Power's scalar ISA, it works more-or-less fine for scalar, but should not be extended to vectors of CR registers. Running out of integer registers just because of masks is not a concern, we have 128. Using CR registers violates point 2 because one of the top 3 or 4 most common operations we want is testing to see if no lanes are active and skipping some section of code based on that (used to implement control flow in SIMT programs) -- We can just compare the generated mask to zero or all ones using scalar compare instructions, additionally we can have mask-generating instructions store to CR0 (or CR1 for FP) if all unmasked lanes, no unmasked lanes, or some unmasked lanes generate set mask bits. That would shrink to just 2 instructions the common sequence:

outer_mask = ...
then_mask = vector_compare_le(outer_mask, a, b);
if(then_mask != 0) {
    // then_part with then_mask
}
...

assembly code:
    ...
    vec_cmp_le. then_mask, a, b, mask=outer_mask
    branch_not cr0.any, skip
    // then_part with then_mask
skip:
    ...

which comes from the inner `if` in the following SIMT code:
if(outer_condition) { // could also be a loop instead
    ...
    if(a <= b) {
        // then_part
    }
    ...
}

Other benefits of integer registers as masks:
load/store for spilling takes 1 instruction, not several. Also, takes waay less memory.
All the fancy bit manipulation instructions operate directly on integer registers: find highest/lowest set bit, popcount, shifts, rotates, bit-cyclone, etc.

Implementation strategies:
Optimize for the common case when VL is smaller than 16 (or 8). Using a larger VL means we're likely to run out of registers very quickly for all but the simplest of shaders, and our current microarchitecture is highly unlikely to give much benefit past VL=8 or so.
We can split up the one or two integer registers optimized for masking into subregisters, but, to allow instructions to not have dependency matrix problems, we split it up differently:
every 8th (or 16th) bit is grouped into the same subregister.
register bit         subregister number
bit  0               subregister 0
bit  1               subregister 1
bit  2               subregister 2
bit  3               subregister 3
bit  4               subregister 4
bit  5               subregister 5
bit  6               subregister 6
bit  7               subregister 7
bit  8               subregister 0
bit  9               subregister 1
bit 10               subregister 2
bit 11               subregister 3
bit 12               subregister 4
bit 13               subregister 5
bit 14               subregister 6
bit 15               subregister 7
bit 16               subregister 0
...

This allows us to, for VL smaller than the number of subregisters per register, act as if every mask bit was an independent register, giving us all the dependency-matrix goodness that comes with that.

This also requires waay less additional registers than even the extending CR to 64 fields idea.
Comment 52 Luke Kenneth Casson Leighton 2020-10-19 02:32:35 BST
(In reply to Jacob Lifshay from comment #51)
> After spending some time to think, I think I came up with an idea:
> 
> I think we should go back to our ideal requirements for the ISA, therefore,
> I think we should account for the following:
> 1. We should design the ISA to be what would work well on future processors.
> 2. We should not add in extra ISA-level steps just because our current
> microarchitecture might require them, that would hamstring future
> microarchitectures that don't need the extra steps.

bear in mind that the basic fundamental principle is to squeeze SV in as a conceptual for-loop between instruction decode and instruction issue.  it turns out that if you stick to this then it is irrelevant what microarchitecture and even to a large extent what ISA the SV concept is applied to.

in addition, keeping to this fundamental principle also makes modifications to existing binutils, compilers and simulators also very simple because, again, it turns out to be literally a for-loop.

> 3. It's fine for our current microarchitecture to be non-optimal for less
> common cases, such as very large VL values.

yes.  two things here: firstly, POWER10 shows a way forward where multi-issue SIMD is possible.  we can do the same.  VL=8 can have multi-issue do 2 or 4 backend SIMD Vector ops per clock.

or

interestingly on the phoronix discussion a big.little concept came up, where the idea was for little cores to have massive SIMD backends.  16 or 32 elements, dog slow at scalar but huge throughput on vector.

> 4. It's ok to add new instructions where necessary, we're doing new things
> after all.

yes.

> 5. It's ok to deviate from how Power's scalar ISA does things when there's a
> better way.

well... there is, as long as the development cost implications for the hardware, toolchain and compilers is not too high.

this is really important because if we deviate too much then we face resistance from the Power community as well.

remember that everything we come up with also has to be justified to the OpenPOWER Foundation ISA WG.  it's not going to get rubberstamped, and if we get resistance and have to maintain our own hard forks of toolchains it completely defeats the objectives of the project.

> Based on the previous points, therefore I think we should do the following:
> 
> Use integer registers for vector masks.

i agree very much that the *application* of predicates should be done from intregs.  the reason is that for the VL=64 max, no matter the microarchitecture it's a single scalar regfile read.

in the dead-simple microarchitectural case (one element issued per clock, like in microwatt's VSX patch that Paul Mackerras is doing) it's ridiculously easy to add in an extra reg read at the start of the VL loop, shift it down on each loop, test bit zero, and skip or not skip the operation.

this illustrates very clearly and cleanly precisely why it's called "Simple" V.

> I honestly think the CR registers are somewhat of a wart of Power's scalar
> ISA, 

true, except it's there, and it has some interesting advantages.  one of them is that tests which take a long time to do (DIV ops) can be stored and manipulated later.

the second is that branches do not delay significantly by having extra gates that would cause them to have to be split across extra cycles.

the combination of these factors reduces in-flight ops in loops and when coming up to a branch point in high performance OoO designs.
 

> it works more-or-less fine for scalar, but should not be extended to
> vectors of CR registers.

no i agree: we should not try to use CRs for predicates (on the input side).  i looked at the possible implementation: it's hell.  64 CRs being read even before issuing the predicated operation would require... 64 CR Read Hazards.

compared to *ONE* scalar 64 bit int reg (regardless of microarchitecture) it is blindingly obvious that int regs for predicate masks is the winning option.

> Running out of integer registers just because of
> masks is not a concern, we have 128. 

yes.

> Using CR registers violates point 2
> because one of the top 3 or 4 most common operations we want is testing to
> see if no lanes are active and skipping some section of code based on that
> (used to implement control flow in SIMT programs) -- We can just compare the
> generated mask to zero or all ones using scalar compare instructions,

funny i just described the same thing to Alain in a phone call yesterday.

> additionally we can have mask-generating instructions store to CR0 (or CR1
> for FP) if all unmasked lanes, no unmasked lanes, or some unmasked lanes
> generate set mask bits.

interesting.  do note this on the wiki page please.  the reason i like it is because it means that branch can be entirely left alone.  we do not need to modify PowerISA branch *at all*.

i wonder if some pre-existing integer operations already effectively do exactly this.  for example a compare of an int reg against zero will produce a CR0 "equal to zero" bit.

this is important to check, because if int reg operations already do the job we have less deviation and therefore a higher chance of acceptance.


> Other benefits of integer registers as masks:
> load/store for spilling takes 1 instruction, not several. Also, takes waay
> less memory.
> All the fancy bit manipulation instructions operate directly on integer
> registers: find highest/lowest set bit, popcount, shifts, rotates,
> bit-cyclone, etc.

there are some additional crucial bitmanipulation instructions needed here, including one that propagates from the first 1 and stops at the next 1.  some of these are listed in RVV's "mask" opcodes, and they are essential for efficiently doing strncpy and other operations.

we can drop these in as "scalar bitmanip" where they will benefit scalar as well.

> Implementation strategies:
> Optimize for the common case when VL is smaller than 16 (or 8). Using a
> larger VL means we're likely to run out of registers very quickly for all
> but the simplest of shaders, and our current microarchitecture is highly
> unlikely to give much benefit past VL=8 or so.

see above about the POWER10 multi-issue strategy, and about the big.little idea.


> We can split up the one or two integer registers optimized for masking into
> subregisters, but, to allow instructions to not have dependency matrix
> problems,

ahh actually, a single scalar intreg as a predicate mask is dead simple.  it's one read.  that's it.

now, all the preficated element ops have to have a shadow column waiting *for* that read to complete, but this is not hard.

> we split it up differently:
> every 8th (or 16th) bit is grouped into the same subregister.

i *think* what you are saying is that the VL-based for-loop should do 8 elements at a time, push these into SIMD ALUs 8 at a time, so if FP32 then that would be 4x SIMD 2xFP32 issue in one cycle.

below... let us say that this is an elwidth of 8.

> register bit         subregister number
> bit  0               subregister 0
> bit  1               subregister 1
> bit  2               subregister 2
> bit  3               subregister 3
> bit  4               subregister 4
> bit  5               subregister 5
> bit  6               subregister 6
> bit  7               subregister 7

so this would go into one 64bit SIMD-aware ALU, with the Dynamic Partitions set to 8 bit, and the first 8 bits of the integ predicate would also be sent in as "write enable" lines.

if however all 8 bits of the predicate mask 0-7 were ALL zero then the Shadow Matrix would pull "GO_DIE" on that entire FU's operation.

> bit  8               subregister 0
> bit  9               subregister 1
> bit 10               subregister 2
> bit 11               subregister 3
> bit 12               subregister 4
> bit 13               subregister 5
> bit 14               subregister 6
> bit 15               subregister 7

likewise this would be sent to a separate SIMD-aware ALU, but this time using bits 8-15 of the intregs predicate.

again it would still be shadowed, and again, if bits 8-15 were zero GODIE would be pulled.

in each case the operation goes ahead but is not allowed to write until the predicate has actually been read from the regfile, and its bits divided up and analysed.


> bit 16               subregister 0
> ...
> 
> This allows us to, for VL smaller than the number of subregisters per
> register, act as if every mask bit was an independent register, giving us
> all the dependency-matrix goodness that comes with that.
> 
> This also requires waay less additional registers than even the extending CR
> to 64 fields idea.

indeed.

this just leaves the completely separate issue of whether to vectorise CR production, *including* all current scalar production of CR0 (and CR7).

i am referring to all "add." "or." operations (Rc=1) as well as cmp, and also to CR operstions themselves (crand etc).

the reasons are as follows:

1) any modification of the ISA to replace CR generation with storage in an integer scalar reg bitfield is a "hard sell" to OPF, as well as gcc and llvm scalar PowerISA maintainers.

2) for suboptimal (easy, slow) microarchitectures it is easy, but for parallel architectures the Write Hazard of a single int reg becomes a serious bottleneck.

3) the codepath in HDL actually requires modification to add "if in VL mode fo something weird to select only one cmp bit otherwise do normal CR stuff".  whereas if if it is left as-is the *existing* CR handling HDL can be parallelised alongside the ALU element operation and it's an easy sell to HDL engineers.

bottom line is that it is not hard to vectorise CR production right alongside the result production, in fact if we *don't* do that i think we're going to face some tough questions from experienced OPF and IBM ISA people (once they grasp SV which Paul definitely does)

it is also not hard to vectorise the tranfer operations between CRs and intregs, and if we allow transfer of Vectors of CRs into one scalar intreg (which is already what "mfcr" already does!) then we keep to existing PowerISA design concepts, have the benefits of VRs, yet can still transfer vectors of CR tests to an intreg and perform bitmanip operations, clz, popcount and many more on it, efficiently and effectively.
Comment 53 Jacob Lifshay 2020-10-19 03:34:51 BST
(In reply to Luke Kenneth Casson Leighton from comment #52)
> (In reply to Jacob Lifshay from comment #51)
> > Implementation strategies:
> > Optimize for the common case when VL is smaller than 16 (or 8). Using a
> > larger VL means we're likely to run out of registers very quickly for all
> > but the simplest of shaders, and our current microarchitecture is highly
> > unlikely to give much benefit past VL=8 or so.
> 
> see above about the POWER10 multi-issue strategy, and about the big.little
> idea.

I like the idea of having more execution capacity, however the issue I'm pointing out is that the compiler will just run out of ISA-level registers, if each SIMT shader needs space for 1 4x4 f32 matrix (very common for vertex shaders), that means VL can't be >16 because there aren't enough registers to even hold that many matrixes. (admittedly, the matrix is often the same for all shaders and can be shared, but you get my point: there's not much space.)

> 
> > We can split up the one or two integer registers optimized for masking into
> > subregisters, but, to allow instructions to not have dependency matrix
> > problems,
> 
> ahh actually, a single scalar intreg as a predicate mask is dead simple. 
> it's one read.  that's it.

That's true ... if you completely ignore the need to generate masks.

> now, all the preficated element ops have to have a shadow column waiting
> *for* that read to complete, but this is not hard.
> 
> > we split it up differently:
> > every 8th (or 16th) bit is grouped into the same subregister.
> 
> i *think* what you are saying is that the VL-based for-loop should do 8
> elements at a time, push these into SIMD ALUs 8 at a time, so if FP32 then
> that would be 4x SIMD 2xFP32 issue in one cycle.

Nope, what I had meant was to go back to the idea of having a microarchitectural register for every bit of an ISA-level integer register, which allows the equivalent of Cray-style vector instruction chaining. Then, since having that many columns (rows? icr) in the scheduling dependency matrix isn't good, we group bits together, reducing the number of microarchitectural registers:
microarchitectural reg 0: bits 0, 8, 16, 24, 32, 40, 48, and 56 of ISA-level reg
microarchitectural reg 1: bits 1, 9, 17, 25, 33, 41, 49, and 57 of ISA-level reg
microarchitectural reg 2: bits 2, 10, 18, 26, 34, 42, 50, and 58 of ISA-level reg
microarchitectural reg 3: bits 3, 11, 19, 27, 35, 43, 51, and 59 of ISA-level reg
microarchitectural reg 4: bits 4, 12, 20, 28, 36, 44, 52, and 60 of ISA-level reg
microarchitectural reg 5: bits 5, 13, 21, 29, 37, 45, 53, and 61 of ISA-level reg
microarchitectural reg 6: bits 6, 14, 22, 30, 38, 46, 54, and 62 of ISA-level reg
microarchitectural reg 7: bits 7, 15, 23, 31, 39, 47, 55, and 63 of ISA-level reg

This allows a vector compare followed by a masked op to have elements in a masked op start when those corresponding compare elements finish, without having to wait for all compares to finish -- just like vector chaining. This is another reason to not have CRs hold the mask result of a vector compare operation (which *can* be different than scalar compare), since that just doubles the number of registers that the scheduler has to handle to get chaining right, and introduces another instruction of delay.

> 
> this just leaves the completely separate issue of whether to vectorise CR
> production, *including* all current scalar production of CR0 (and CR7).
> 
> i am referring to all "add." "or." operations (Rc=1) as well as cmp, and
> also to CR operstions themselves (crand etc).
> 
> the reasons are as follows:
> 
> 1) any modification of the ISA to replace CR generation with storage in an
> integer scalar reg bitfield is a "hard sell" to OPF, as well as gcc and llvm
> scalar PowerISA maintainers.

I'm advocating for vector ops to target integer registers, scalar ops still do the standard CR things. Rc=1 for vector ops can just generate a mask for eq/ne (I think the most common compare op) or we can just reassign Rc to mean something else for vector ops (one option is to just declare it invalid).

binutils is just about the only place where you might want to treat scalar and vector instructions the same, everywhere else (e.g. LLVM) treats vectors differently.
> 
> 2) for suboptimal (easy, slow) microarchitectures it is easy, but for
> parallel architectures the Write Hazard of a single int reg becomes a
> serious bottleneck.

My idea for splitting up the integer register(s) optimized for masks into seperate bits handles this.

> 
> 3) the codepath in HDL actually requires modification to add "if in VL mode
> fo something weird to select only one cmp bit otherwise do normal CR stuff".
> whereas if if it is left as-is the *existing* CR handling HDL can be
> parallelised alongside the ALU element operation and it's an easy sell to
> HDL engineers.
> 
> bottom line is that it is not hard to vectorise CR production right
> alongside the result production, in fact if we *don't* do that i think we're
> going to face some tough questions from experienced OPF and IBM ISA people
> (once they grasp SV which Paul definitely does)

Well, I think the benefits of using integer registers as masks and skipping the extra copy through CRs outweighs the loss of orthogonality (though one could argue that having less register files to deal with increases orthogonality).
Comment 54 Cole Poirier 2020-10-19 03:53:11 BST
(In reply to Luke Kenneth Casson Leighton from comment #52)

> it is also not hard to vectorise the tranfer operations between CRs and
> intregs, and if we allow transfer of Vectors of CRs into one scalar intreg
> (which is already what "mfcr" already does!) then we keep to existing
> PowerISA design concepts, have the benefits of VRs, yet can still transfer
> vectors of CR tests to an intreg and perform bitmanip operations, clz,
> popcount and many more on it, efficiently and effectively.

Here’s a link to the RVV bitmanip verilog reference code, its already on the wiki’s resources page, but I think it’s useful to have inline here. https://github.com/riscv/riscv-bitmanip/tree/master/verilog

(The rvv equivalents of bperm are in rvb_bextdep in case anyone is looking for it)
Comment 55 Luke Kenneth Casson Leighton 2020-10-19 05:34:17 BST
(In reply to Jacob Lifshay from comment #53)
> (In reply to Luke Kenneth Casson Leighton from comment #52)
> > ahh actually, a single scalar intreg as a predicate mask is dead simple. 
> > it's one read.  that's it.
> 
> That's true ... if you completely ignore the need to generate masks.

briefly (it's late here), so i'll just do this one and the rest tomorrow.
i'm not [ignoring it]: i'm assuming that integer scalar
operations (and, xor etc) on those integer scalar registers would
be sufficient to cover the role of generating the masks because the
masks *are* the (one) scalar int reg, the one scalar int reg *is* the
mask.

i.e. once computed (generated) using integer scalar operations that
int reg (mask) goes straight (ok after DM hazard clearance) into the
bit-level subdivision needed to turn it into a vector mask.

what am i missing?  did you mean something different by "need to generate mask"?
i interpret "generate mask" to be "operations such as bitwise ANDing" and
for that, clearly, straight scalar 64-bit AND is perfectly sufficient.
Comment 56 Jacob Lifshay 2020-10-19 06:00:32 BST
(In reply to Luke Kenneth Casson Leighton from comment #55)
> (In reply to Jacob Lifshay from comment #53)
> > (In reply to Luke Kenneth Casson Leighton from comment #52)
> > > ahh actually, a single scalar intreg as a predicate mask is dead simple. 
> > > it's one read.  that's it.
> > 
> > That's true ... if you completely ignore the need to generate masks.

I meant that masks should be able to have some bits calculated and used by succeeding masked vector instructions before other bits have finished calculating -- this is critical for vector chaining where a vector compare, optional scalar bitwise logic ops, then vector instructions using the computed mask are all chained together. Probably only `and`, `andn`, and `or` need to have special handling allowing chaining through them, everything else can just wait for all bits of the mask to be available before executing the operation.

Chaining is as described in comment #53

> briefly (it's late here), so i'll just do this one and the rest tomorrow.
> i'm not [ignoring it]: i'm assuming that integer scalar
> operations (and, xor etc) on those integer scalar registers would
> be sufficient to cover the role of generating the masks because the
> masks *are* the (one) scalar int reg, the one scalar int reg *is* the
> mask.
> 
> i.e. once computed (generated) using integer scalar operations that
> int reg (mask) goes straight (ok after DM hazard clearance) into the
> bit-level subdivision needed to turn it into a vector mask.
> 
> what am i missing?  did you mean something different by "need to generate
> mask"?
> i interpret "generate mask" to be "operations such as bitwise ANDing" and
> for that, clearly, straight scalar 64-bit AND is perfectly sufficient.
Comment 57 Luke Kenneth Casson Leighton 2020-10-19 06:09:55 BST
again brief message (honest), traditional vector has element data staying "in place", it neither moves lane nor regnum (ok potentially renamed PRF ARF but no more).

with traditional vectors typically being absolutely mentallly long to the extent of needing external hyperfast SRAM it is both critical that they not move about and also  the only sane way to do predicate bitmasks in this type of arrangement is if they are treated themselves as actual vector registers.

typically (just like in RVV) which aims to do MAXVL (num of lanes) anywhere from 1 to 2^64-1 - yes really - they take bit 0 of each element as the predicate bit and ignore all others.

by setting a hard limit of 64 (actually len(Intreg)) we no longer have the constraints that force predicates to be actual vectors and therefore *also no longer need to consider them to be static in-place registers*.

the predicates in the SV arrangement i am thinking of are *input data only* and once processed are chucked away.

this because they are stored "in-flight" and come from a Read-Hazard-protected regfile.

the arrangement you are advocating Jacob is ideally suited to "in-place" data processing where the expectation is that after the mask register has been "used" y the operation it then goes on elsewhere for further processing.

what actually happens in an OoO engine is that a COPY of the Hazard-managed register goes into a Reservation-Station latch, and after use is totally discarded.

thus the intreg used as a predicate bitmask is COPIED

if a new predicate bitmask needs to be "generated" and it happens to involve the current mask then ANOTHER COPY of that mask, as an intreg, will go into ANOTHER totally separate RS latch, along with another intreg, and the result written into an output RS latch and, after regfile DM clearance, go into another int regfile.

operand forwarding can get that into a predicate RS without having to go into the regfile.

these int ops (which happen by coincidence to be used as predicate masks) can be done IN PARALLEL with the vector ops that will be using them, hilariously using the exact same ALUs.
Comment 58 Luke Kenneth Casson Leighton 2020-10-19 06:20:13 BST
(In reply to Jacob Lifshay from comment #56)

> I meant that masks should be able to have some bits calculated and used by
> succeeding masked vector instructions before other bits have finished
> calculating -- 

bit of overlap: this is not a problem, at all.  the OoO engine doesn't care if the ALUs are going to be used as predicates or as vector elements, it's all scalar as far as it is concerned.

the only trick is to make sure that there are enough ALUs.  if we only allocate say 2x FP64 ALUs or 2x SIMD-capable INT ALUs, expecting this to cope with sustained 2x FP64 predicated vector ops or 2x 8xbyte SIMD ops that's not gonna happen because we forgot to allocate the extra ALU needed to do the ANDing and XORing (etc) scalar mask ops.

okok for FP it would be fine because FP is a totaly separate pipeline from INT logical.

okok actually INT Logical is even totally separate from INT arithmetic pipeline

:)

but you get the general idea. to get contention we would need the logical ALUs to be totally jammed with vector ops such that they fought for pipeline resources with the (scalar) predicate mask gen ops.

under these circumstances we would simply... allocate more Logical pipelines and increase the number of Logical RSes.
Comment 59 Jacob Lifshay 2020-10-19 07:58:59 BST
I created an example on the wiki, which hopefully finally gets across my point about vector chaining through the mask:
https://libre-soc.org/simple_v_extension/masked_vector_chaining/
Comment 60 Luke Kenneth Casson Leighton 2020-10-19 11:59:48 BST
(In reply to Jacob Lifshay from comment #59)
> I created an example on the wiki, which hopefully finally gets across my
> point about vector chaining through the mask:
> https://libre-soc.org/simple_v_extension/masked_vector_chaining/

diagrams, oooo, i love it.  yep got it.  ok let me think it through.
Comment 61 Luke Kenneth Casson Leighton 2020-10-19 12:54:27 BST
darn it, this might be partly be why the traditional cray vector system wastes an entire vector element as a predicate, throwing away all of the top bits.  by "wasting" an ALU doing a single bit operation it has the side-effect of reducing latency.

if the predicate is treated as a single dependency hazard, then yes it becomes a bottleneck.

no huge surprise there, just took me a while to "get".

all solutions no matter how implemented and no matter the microarchitecture involve breaking the predicate down into element-sized chunks that whatever Hazard-tracking is on that microarchitecture can use to get element interleaving.

we *might* be able to treat the scalar int as not a scalar at all but as a "vector of 8 bit integers".

around 18 months ago i came up with a scheme where extra 8 bit ALUs were added to be able to cope with weird small non-power-of-2 Vector Lengths.

the DMs had extra cascading "tree" logic (as extra rows) where if you used a reg for these obtuse 8 bit vector operations they marked the *main* 64 bit FU-REGS DepMatrix Hazard flag *and* marked the relevant 8-bit chunks.

thus you could get overlapping 8 bit operations on *different parts* of a 64 bit register, and because we have byte-level write-enable (8 of them on the 64 bit regfile) there is no RD-MOFIFY-WR problem.

plus the "cascade" means that 64 bit reg is protected from being corrupted if needed to be accessed as a 64 bit reg instead of as an 8x8bit vector.

it's pretty horrendously complicated which is the reason i didn't rave about it because it's an optimisation.

or so i thought.

well, it still is, but it looks like it'll be a pretty important one.

the only thing is, it's no good doing those predicate calculations at the 8 bit level if you then go and *read* the damn predicate as a single 64 bit scalar op.  that defeats the entire exercise!

the instruction issue engine would have to issue 8x 8bit reads, *not* issue 1x 64 bit read.

i think that's doable.  i.e the predicate system would hook into the exact same 8bit DM logic that the scalar 8bit ops just used.

argh :)
Comment 62 Luke Kenneth Casson Leighton 2020-10-19 13:34:41 BST
reminder:
https://libre-soc.org/3d_gpu/reorder_alias_bytemask_scheme.jpg

woow, that was dec 2018 :)

the aliasing (the cascade) is the diagram at the top.  after drawing
this out i realised that it would be horrendous to have a combination
of 64-bit FUs, 32-bit FUs, 16-bit FUs *and* 8-bit FUs.

this was where the idea for splitting into 2x 32-bit FUs that "cooperated"
to do 1x 64 bit op came from.

by splitting into 32-bit FUs and then having a small quantity of only 4x 8-bit FUs we do not end up with the horrendous proliferation in size of the FU-REGS Dep Matrix, dwarfing the size of the ALUs that they manage.

it would be perfectly fine to add *only* the kinds of operations needed to do predicate masks broken down to this level of granularity.  LOGICAL pipeline done @ 8-bit granularity, but not the ADD or MUL pipelines.

of course if someone goes "oh look there's this great 8-bit-level Vector LOGICAL processing capability, let's hex that until the silicon glows.. oh and let's do it as predicated ops" then of course there would be contention in the 8-bit RSes.

*but* - hilariously - if they really really wanted to do that then we could conceivably detect that case and use the *64* bit scalar ops for calculating the predicate masks!

urr.... my brain has melted.
Comment 63 Luke Kenneth Casson Leighton 2020-10-19 13:47:04 BST
wait... wait... arrgh no that doesn't quite work, because in some cases you actually want 4 bits of the predicate mask to go to the SIMD-capable ALU, sometimes you want 2 bits (for 2xFP32), sometimes 1 bit (for 1xFP64) and so even an 8-bit subdivision is going to be sub-optimal.

argh.

haha.  you're going to find this amusing / ironic: this is precisely where using CRs as predicate masks would shine.

the load on the DMs would be horrendous unless we worked out a way to "batch" them.  and funnily enough, i've already implemented 8xCR "whole_reg" reading (and noted a bugreport to implement that "cascade" system when it comes to adding the DMs).
Comment 64 Jacob Lifshay 2020-10-19 17:56:22 BST
(In reply to Luke Kenneth Casson Leighton from comment #63)
> wait... wait... arrgh no that doesn't quite work, because in some cases you
> actually want 4 bits of the predicate mask to go to the SIMD-capable ALU,
> sometimes you want 2 bits (for 2xFP32), sometimes 1 bit (for 1xFP64) and so
> even an 8-bit subdivision is going to be sub-optimal.
> 
> argh.
> 
> haha.  you're going to find this amusing / ironic: this is precisely where
> using CRs as predicate masks would shine.
> 
> the load on the DMs would be horrendous unless we worked out a way to
> "batch" them.  and funnily enough, i've already implemented 8xCR "whole_reg"
> reading (and noted a bugreport to implement that "cascade" system when it
> comes to adding the DMs).

I'm advocating doing a similar thing, except at the bit group level for 1 or 2 specially optimized integer registers instead of the CR field level with a vector of CRs.

See comment 53 for details of how a 64-bit register can be broken into 8 8-bit registers. I think we should support either 8 or 16 subgroups for 2 integer registers designated as optimized for vector masks. We would design the SVprefix such that compare ops target only those 2 registers and the execution mask for an instruction can be set to those 2 registers or their complement. If we have space in SVprefix, we could expand to more than 2 registers -- also, not all of the registers we can use for masks have to be split up, we can fall back to the less efficient non-vector-chaining approach when the compiler intentionally picks a different register than the 1 or 2 we have optimized for masks.

As for which registers to use for masks, I think at least 1 of them should be the 1st argument register/return register (since passing execution masks between functions is common) and one should be a callee-saved register, the rest can be selected as needed.

Importantly, I'd strongly argue for a dense bitvector as masks, rather than using the LSB of 8-bit (or more) elements, since that works much better with the bit manipulation instructions, e.g. find first clear mask lane becomes a not (potentially combinable with the op generating the mask) and a find lowest set bit. This directly gives the lane index.

By contrast, using 8-bit lanes for masks means we'd have to add extra logic to handle VL > 8 and we'd have to handle scaling the result (an extra shift instruction), and we'd have to handle making sure that lanes have all bits set before inverting them. If we instead decide to have an on lane generate 0xFF instead of 0x01, then popcount is likewise messed up.

All of the above mess is solved efficiently by just having 1 bit per lane.

Vectorized CRs still have a bunch of the above mess, because they aren't 1 bit per lane. Also, they have a ISA-level limiting effect on large VLs because of quickly running out of the 64 CRs when you need multiple masks (common in non-trivial shaders).
Comment 65 Luke Kenneth Casson Leighton 2020-10-19 18:23:54 BST
(In reply to Jacob Lifshay from comment #64)

> [... stuff for me to analyse and think about...]

> Vectorized CRs still have a bunch of the above mess, because they aren't 1
> bit per lane. Also, they have a ISA-level limiting effect on large VLs
> because of quickly running out of the 64 CRs when you need multiple masks
> (common in non-trivial shaders).

yes and no: remember they're 4-bit.  so that's 64x4 bits worth of stuff that can be used for vector masks = 256 bits.  if we run out of those we're doing something wrong :)
Comment 66 Jacob Lifshay 2020-10-19 18:39:11 BST
(In reply to Luke Kenneth Casson Leighton from comment #65)
> (In reply to Jacob Lifshay from comment #64)
> 
> > [... stuff for me to analyse and think about...]
> 
> > Vectorized CRs still have a bunch of the above mess, because they aren't 1
> > bit per lane. Also, they have a ISA-level limiting effect on large VLs
> > because of quickly running out of the 64 CRs when you need multiple masks
> > (common in non-trivial shaders).
> 
> yes and no: remember they're 4-bit.  so that's 64x4 bits worth of stuff that
> can be used for vector masks = 256 bits.  if we run out of those we're doing
> something wrong :)

the problem is that all 4 bits in each CR field is written by a compare (the most common mask generating operation), effectively changing it to be only useful for 1 mask lane per CR field, since all other masks that could be stored in the CRs are overwritten.
Comment 67 Luke Kenneth Casson Leighton 2020-10-19 19:38:23 BST
(In reply to Jacob Lifshay from comment #66)

> the problem is that all 4 bits in each CR field is written by a compare (the
> most common mask generating operation), effectively changing it to be only
> useful for 1 mask lane per CR field, since all other masks that could be
> stored in the CRs are overwritten.

tck, tck, *thinks*...

if the limit's 64 CRs (no reason why we should not have 128, and a case could be made that, well, 128 int/fp regs therefore 128 CRs) they can be copied to intregs (mfcr) and back (mtcr), and in many cases (applying esoteric bitmanip ops) that's what would be needed anyway.

the question is, really: realistically what the heck are we doing VL@64 for, that would use up that many CRs?

give me a mo to go over the vector 8-bit mask idea
Comment 68 Jacob Lifshay 2020-10-19 19:45:09 BST
(In reply to Luke Kenneth Casson Leighton from comment #67)
> (In reply to Jacob Lifshay from comment #66)
> 
> > the problem is that all 4 bits in each CR field is written by a compare (the
> > most common mask generating operation), effectively changing it to be only
> > useful for 1 mask lane per CR field, since all other masks that could be
> > stored in the CRs are overwritten.
> 
> tck, tck, *thinks*...
> 
> if the limit's 64 CRs (no reason why we should not have 128, and a case
> could be made that, well, 128 int/fp regs therefore 128 CRs) they can be
> copied to intregs (mfcr) and back (mtcr), and in many cases (applying
> esoteric bitmanip ops) that's what would be needed anyway.
> 
> the question is, really: realistically what the heck are we doing VL@64 for,
> that would use up that many CRs?

strncat?

if we decide to used vectorized CRs we would also need instructions for creating dense bitvectors from CRs for all the bitmanip goodness. Similar instructions would be needed for 8-bit per lane masks. using 1-bit per lane masks bypasses all that since it's already the correct type of bitvector.

> 
> give me a mo to go over the vector 8-bit mask idea
Comment 69 Luke Kenneth Casson Leighton 2020-10-19 19:59:00 BST
(In reply to Jacob Lifshay from comment #53)

> which allows the equivalent of Cray-style vector instruction chaining. Then,
> since having that many columns (rows? icr) in the scheduling dependency
> matrix isn't good, 

i have a sneaking suspicion that we are going to need to do a "register cache".  PRF-ARF style, except backwards.  normally PRF-ARF is done to give *more* registers (internally) than is in the ISA (e.g. x86 only 16): we need *less*!

a register cache would allow us to not just reduce the number of DepMatrix columns in FU-REGs, it would also allow us the chance to "tag" them and make no distinction between FP and INT as far as ALUs and RSes are concerned.

bottom line is, don't worry too much about DM sizes.  we're not doing 128x30 (which is 250,000 gates), 128 regs x 30 Function Units, we're more likely doing... mmmm... 48 x 30 or so.  just have to see how many actual in-flight ops are needed.

i can't quite get my head round the fact that POWER10 can handle a THOUSAND in-flight instructions.
Comment 70 Luke Kenneth Casson Leighton 2020-10-19 21:29:07 BST
(In reply to Jacob Lifshay from comment #64)

> By contrast, using 8-bit lanes for masks means we'd have to add extra logic
> to handle VL > 8 and we'd have to handle scaling the result (an extra shift
> instruction), and we'd have to handle making sure that lanes have all bits
> set before inverting them. If we instead decide to have an on lane generate
> 0xFF instead of 0x01, then popcount is likewise messed up.
> 
> All of the above mess is solved efficiently by just having 1 bit per lane.

ok, the problem is that it's not that simple (never is).  there is no concept of "lanes" in SV.  or there is: they're the ALU widths (which will be either 64-SIMD or we did discuss doing 32-SIMD and splitting the regfile into HI-32 and LO-32, so that 64-bit operations need a pair of 32-wide ALUs to collaborate)

these ALU widths are completely divorced from architectural (ISA, SV) element widths, and consequently no amount of choice of bit-width for predicate lanes - whether it be 8-bit, 16-bit, is going to cut it.

the reason is because:

* when you request elwidth=8bit operations, you need *8* predicate bits
  to be allocated (routed) to a given 64-bit SIMD ALU
* when you request elwidth=16bit, that's 4 predicate bits
* elwidth=32 bit that's 2 predicate bits
* elwidth-64 bit is only 1

the routing and DM allocation on that - the subdivision of the 8-bit masks concept - is going to be a pig.


> Vectorized CRs still have a bunch of the above mess, because they aren't 1
> bit per lane.

again you're conflating the (false/inapplicable) concept of "lanes" as being an architectural concept in SV elements, where it can't actually be applied.  i know it works in Cray-style Vector ISAs, but it doesn't work here.

the only thing that's really going to work is to have *element* based predicates.  Cray-style architectures (including RVV) do this by allocating an entire element of a vector as a predicate (ignoring all but the *one* LSB).

our equivalent is "registers".  actual scalar registers.

in other words: to solve the problem that you highlighted (overlaps) we *need* each predicate to be in *independent* scalar registers.

and it turns out that PowerISA has something that we happen to already have planned to allocate DM space for them, even though they're only 4 bit wide: CRs.

so Vectorised CRs _are_ a bit of a mess, but they're a mess because unusual bitmanip ops don't exist for them (only AND/OR/NAND/XOR etc.) and that can be solved by just vectorising mfcr, and running int scalar bitmanip ops.  which we can macro-op fuse if we really want to (later).


> Also, they have a ISA-level limiting effect on large VLs
> because of quickly running out of the 64 CRs when you need multiple masks
> (common in non-trivial shaders).

i think we can solve that one by doing 128 CRs.  that gives a total of 128x4=512 bits worth of predicate mask space.  and intregs can be used as "spill" if we absolutely have to.
Comment 71 Luke Kenneth Casson Leighton 2020-10-19 22:23:28 BST
(In reply to Jacob Lifshay from comment #68)

> > the question is, really: realistically what the heck are we doing VL@64 for,
> > that would use up that many CRs?
> 
> strncat?

oh yeah :)  although a case could be made - my point is, i think, that learning from POWER10's *eight*-way multi-issue and applying it on on "elwidth=8 VL=16", would achieve the same end result but without overtaxing the regfiles.
 
> if we decide to used vectorized CRs we would also need instructions for
> creating dense bitvectors from CRs for all the bitmanip goodness. Similar
> instructions would be needed for 8-bit per lane masks. using 1-bit per lane
> masks bypasses all that since it's already the correct type of bitvector.

i think i kinda worked out (and showed) that this is only true - only practical - if the 1 bit gets its own DM row.  and if you're going to do that, CRs already fit that bill.

we _could_ conceivably do bit-level DM subdivision onto 64 bit integer regs but... no, please, no :)  it makes a mess of the "Register Cache" idea, unfortunately.

whereas CRs we have the freedom *to* decide how many we want to extend it to.
Comment 72 Jacob Lifshay 2020-10-19 23:36:46 BST
(In reply to Luke Kenneth Casson Leighton from comment #70)
> (In reply to Jacob Lifshay from comment #64)
> 
> > By contrast, using 8-bit lanes for masks means we'd have to add extra logic
> > to handle VL > 8 and we'd have to handle scaling the result (an extra shift
> > instruction), and we'd have to handle making sure that lanes have all bits
> > set before inverting them. If we instead decide to have an on lane generate
> > 0xFF instead of 0x01, then popcount is likewise messed up.
> > 
> > All of the above mess is solved efficiently by just having 1 bit per lane.
> 
> ok, the problem is that it's not that simple (never is).  there is no
> concept of "lanes" in SV.  or there is: they're the ALU widths (which will
> be either 64-SIMD or we did discuss doing 32-SIMD and splitting the regfile
> into HI-32 and LO-32, so that 64-bit operations need a pair of 32-wide ALUs
> to collaborate)
> 
> these ALU widths are completely divorced from architectural (ISA, SV)
> element widths, and consequently no amount of choice of bit-width for
> predicate lanes - whether it be 8-bit, 16-bit, is going to cut it.
> 
> the reason is because:
> 
> * when you request elwidth=8bit operations, you need *8* predicate bits
>   to be allocated (routed) to a given 64-bit SIMD ALU

All you need is a bitwise right shifter to send the next 8 bits from the vector mask to the ALU, the alu can have the few muxes needed to pick out which of the bits it needs from those 8 bits. I'd estimate somewhere around 30 muxes max, assuming it's translated to byte-level enables.

> * when you request elwidth=16bit, that's 4 predicate bits
> * elwidth=32 bit that's 2 predicate bits
> * elwidth-64 bit is only 1
> 
> the routing and DM allocation on that - the subdivision of the 8-bit masks
> concept - is going to be a pig.

DM allocation should be pretty simple -- only needed at decode time, just (assuming integer masks registers are split into 16 subregisters) 4 16-bit decoders with their outputs ORed together, one decoder for 8, 16, 32, and 64-bit elements.
> 
> 
> > Vectorized CRs still have a bunch of the above mess, because they aren't 1
> > bit per lane.
> 
> again you're conflating the (false/inapplicable) concept of "lanes" as being
> an architectural concept in SV elements, where it can't actually be applied.
> i know it works in Cray-style Vector ISAs, but it doesn't work here.

I'm using the word "lane" to mean the same thing as "element" since it's shorter to type, I'm ignoring subvectors for now. A lane/element is the thing that VL counts. There is a single conceptual bool per lane/element in a vector mask. a lane/element can be 8/16/32/64-bits (or up to 256-bits with subvectors -- 64x4-bits).
> 
> the only thing that's really going to work is to have *element* based
> predicates.  Cray-style architectures (including RVV) do this by allocating
> an entire element of a vector as a predicate (ignoring all but the *one*
> LSB).

IIRC, the Cray-1 uses a mask register with 1 bit per lane/element -- similar to what I'm advocating for.

> 
> our equivalent is "registers".  actual scalar registers.
> 
> in other words: to solve the problem that you highlighted (overlaps) we
> *need* each predicate to be in *independent* scalar registers.

no, we just need each predicate to be separate microarchitectural registers -- we *don't* need separate ISA-level registers. hence why I'm advocating for having about 2 ISA-level integer registers to be split into many separate microarchitectural registers, all other integer registers are still 1:1 with microarchitectural registers (ignoring splitting into 8/16/32/64-bit element-sized pieces).
> 
> and it turns out that PowerISA has something that we happen to already have
> planned to allocate DM space for them, even though they're only 4 bit wide:
> CRs.

If we decide to go with the design I'm advocating for, there are 1x 32-bit CR split into 8x 4-bit subregisters (standard scalar), 128x 64-bit FP registers, 126x 64-bit integer registers not optimized for masks, and 2x 64-bit integer registers optimized for masking by each being split into 8 or 16 subregisters which are interleaved to form the whole 64-bit register, as described in comment 53.
> 
> so Vectorised CRs _are_ a bit of a mess, but they're a mess because unusual
> bitmanip ops don't exist for them (only AND/OR/NAND/XOR etc.) and that can
> be solved by just vectorising mfcr, and running int scalar bitmanip ops. 
> which we can macro-op fuse if we really want to (later).

mfcr has the wrong semantics, since we want each element to be 1 bit, but mfcr has each element being 4 bits. We would need to add a new instruction, or, we can bypass that whole mess by using integer registers as I suggested.
> 
> 
> > Also, they have a ISA-level limiting effect on large VLs
> > because of quickly running out of the 64 CRs when you need multiple masks
> > (common in non-trivial shaders).
> 
> i think we can solve that one by doing 128 CRs.  that gives a total of
> 128x4=512 bits worth of predicate mask space.  and intregs can be used as
> "spill" if we absolutely have to.

it's much cleaner at the ISA-level to just use integer registers instead of trying to force CRs to work. Also, for most practical purposes, because of how CRs are written by almost all instructions, you'd only have 128-bits of effective mask space -- exactly as much as I'm proposing for the integer register version. Additionally, we wouldn't have to deal with the extra ISA-level mess created by more CRs, since those integer registers appear just like normal integer registers. Additionally, we wouldn't need to decode extra instructions just to be able to use popcount on a mask, saving icache space as well as time due to not needing as many instructions to do the same thing.
Comment 73 Jacob Lifshay 2020-10-19 23:47:32 BST
(In reply to Luke Kenneth Casson Leighton from comment #71)
> (In reply to Jacob Lifshay from comment #68)
> 
> > > the question is, really: realistically what the heck are we doing VL@64 for,
> > > that would use up that many CRs?
> > 
> > strncat?
> 
> oh yeah :)  although a case could be made - my point is, i think, that
> learning from POWER10's *eight*-way multi-issue and applying it on on
> "elwidth=8 VL=16", would achieve the same end result but without overtaxing
> the regfiles.

true, though performance could be improved by keeping pipelines full with larger VLs.
>  
> > if we decide to used vectorized CRs we would also need instructions for
> > creating dense bitvectors from CRs for all the bitmanip goodness. Similar
> > instructions would be needed for 8-bit per lane masks. using 1-bit per lane
> > masks bypasses all that since it's already the correct type of bitvector.
> 
> i think i kinda worked out (and showed) that this is only true - only
> practical - if the 1 bit gets its own DM row. 

exactly what I'm advocating for, except I found a way to overcome the weakness of needing too many DM rows -- splitting the register as described in comment 53, where, for VL < 8 (or 16), it's exactly equivalent to splitting on every bit since the DM rows are assigned ISA-level bits in a rotating interleaved fashion.

> and if you're going to do
> that, CRs already fit that bill.

so we take this hexagonal peg (CRs) and put it in the round hole (masking) -- it's technically not putting a square peg in a round hole -- it might fit better, but it's not optimal.

> 
> we _could_ conceivably do bit-level DM subdivision onto 64 bit integer regs
> but... no, please, no :)  it makes a mess of the "Register Cache" idea,
> unfortunately.

it would totally work, those 
> 
> whereas CRs we have the freedom *to* decide how many we want to extend it to.
Comment 74 Jacob Lifshay 2020-10-19 23:52:27 BST
(In reply to Jacob Lifshay from comment #73)
> (In reply to Luke Kenneth Casson Leighton from comment #71)
> > 
> > we _could_ conceivably do bit-level DM subdivision onto 64 bit integer regs
> > but... no, please, no :)  it makes a mess of the "Register Cache" idea,
> > unfortunately.
> 
> it would totally work, those

oops forgot to finish:

all we need to do is treat those two mask-optimized int regs as separate from the rest -- kinda like CTR is treated differently than the int regs.

> > 
> > whereas CRs we have the freedom *to* decide how many we want to extend it to.

the set of integer registers optimized for masking can be extended too, without needing all the mess of CRs.
Comment 75 Luke Kenneth Casson Leighton 2020-10-20 00:12:13 BST
(In reply to Jacob Lifshay from comment #72)

> All you need is a bitwise right shifter to send the next 8 bits from the
> vector mask to the ALU,

jacob you're not quite getting it: this is only possible to do ("a simple shift") if there are no Dependency Matrices involved.

a simple architecture such as microwatt can do such a shift trick.  an in-order system likewise.

an OoO system MUST track ALL objects regardless of size.

the significance of this had not really sunk in properly for me because i had not realised the latency problem you highlighted.

we have two choices at each end of the spectrum (and some in between)

* bitlevel predicate Dependency Matrices: one bit per element
* "one hit" (one scalar) predicate masks (with associated latency)

when doing bitlevel DMs one optimisation in the VL instruction issue phase is to notice the following:


* VL=16
* elwidth=16
* SIMD width=64
* therefore 4x ops can be batched to each ALU

*BUT*

to do that, you need 4 bits of predicate i.e. 4 predicate regs to be passed to those ALUs.

now, if you start having to get those 4 bits (which can't do the shifting you suggest *because they haven't been read yet*) it quickly becomes hell.

note that DMs track regs *before the contents are available*.  we don't *have* the contents of the predicate mask available at the time in order to be able to shift it!

consequently you have to do that shadow trick, and only when the reg is read *then* you can finish off the bitlevel analysis (shifting if necessary) and send it on to each ALU.

even having an internal PRF ARF special designation: the protection needed, i did try once the idea of making VL a pointer to a reg rather than an immediate, and hoo-boy was it convoluted.


you need to think through: what is the logic needed to implement 8-bit vector mask *when you do not have access to the mask yet*, how will the mask get into the Shadow Matrices, and how does it work for all possible elwidths and all possible values of VL.

basically it's *nowhere* near as "simple" as "a shifter".
Comment 76 Luke Kenneth Casson Leighton 2020-10-20 00:24:36 BST
(In reply to Jacob Lifshay from comment #74)
> (In reply to Jacob Lifshay from comment #73)
> > (In reply to Luke Kenneth Casson Leighton from comment #71)
> > > 
> > > we _could_ conceivably do bit-level DM subdivision onto 64 bit integer regs
> > > but... no, please, no :)  it makes a mess of the "Register Cache" idea,
> > > unfortunately.
> > 
> > it would totally work, those
> 
> oops forgot to finish:
> 
> all we need to do is treat those two mask-optimized int regs as separate
> from the rest -- kinda like CTR is treated differently than the int regs.

i get it.  they still need bit-level DM tracking, and if they are designated as int regs as well it becomes even more hell, at the point where they interact with "real" int regs.

if they are treated as completely separate regs (SPRs in effect) they need their own opcodes, and i really don't want to go down that route.

> > > 
> > > whereas CRs we have the freedom *to* decide how many we want to extend it to.
> 
> the set of integer registers optimized for masking can be extended too,

please understand: it really is too complex to track the dependencies for something that specialised.

> without needing all the mess of CRs.

think it through, jacob: vector CMPs and standard Rc=1 vectorised operations *still require a minimum 64 CRs anyway*

if we have to have vector CRs anyway, they need to be DM tracked (individually).

if they have to be individually tracked then the logic to put them into the Shadow Matrices for predication is far less: no shifts, no masks, just use the DMs.

or we abandon CRs pretty much entirely for vectors and this is not an option i am happy to consider, it will cause havoc for gcc and llvm conpiler developers.
Comment 77 Luke Kenneth Casson Leighton 2020-10-20 00:35:07 BST
jacob, can i please ask you to go over Thornton's book (p126) and Mitch's book chapters, with a view to gaining a full understanding of the augmented precise 6600 system?

the reason i ask is because the ideas that you advocate are extremely challenging for me to analyse their full implications and then also explain them to you, when you may not necessarily understand the explanation *or* i might have misunderstood, and we have a lot of difficulty working out which it is.

also i do not want to be the sole exclusive dependency for implementing something this complex.
Comment 78 Jacob Lifshay 2020-10-20 02:01:51 BST
(In reply to Luke Kenneth Casson Leighton from comment #75)
> (In reply to Jacob Lifshay from comment #72)
> 
> > All you need is a bitwise right shifter to send the next 8 bits from the
> > vector mask to the ALU,
> 
> jacob you're not quite getting it: this is only possible to do ("a simple
> shift") if there are no Dependency Matrices involved.

Yeah, that's true. I had been thinking that the shift operation would just be repeated each time new bits were available from their respective sources. Scrap that idea.

> an OoO system MUST track ALL objects regardless of size.

I was never advocating for not tracking all objects, just some of the objects are different kinds of objects and we should treat them differently.

A demo datapath:
https://libre-soc.org/3d_gpu/int_regs_as_masks.dia.png
https://libre-soc.org/3d_gpu/int_regs_as_masks.dia.svg

The demo datapath leaves the FU registers implicitly part of the corresponding ALU/FU combos due to me not wanting to draw for 4hr.

Muxes in the datapath diagram are actually bidirectional, they have separate muxes for each direction internally.

> the significance of this had not really sunk in properly for me because i
> had not realised the latency problem you highlighted.
> 
> we have two choices at each end of the spectrum (and some in between)
> 
> * bitlevel predicate Dependency Matrices: one bit per element
> * "one hit" (one scalar) predicate masks (with associated latency)
> 
> when doing bitlevel DMs one optimisation in the VL instruction issue phase
> is to notice the following:
> 
> 
> * VL=16
> * elwidth=16
> * SIMD width=64
> * therefore 4x ops can be batched to each ALU
> 
> *BUT*
> 
> to do that, you need 4 bits of predicate i.e. 4 predicate regs to be passed
> to those ALUs.
> 
> now, if you start having to get those 4 bits (which can't do the shifting
> you suggest *because they haven't been read yet*) it quickly becomes hell.
> 
> note that DMs track regs *before the contents are available*.  we don't
> *have* the contents of the predicate mask available at the time in order to
> be able to shift it!

yup.

> consequently you have to do that shadow trick, and only when the reg is read
> *then* you can finish off the bitlevel analysis (shifting if necessary) and
> send it on to each ALU.

Another possible scheme is to have each FU take the mask into it's source latch whenever the mask is ready, then, if the mask is set to 0, the FU can signal the required circuitry to cancel itself. That way, the mask just becomes a normal dependency, rather than needing to be so special.

> even having an internal PRF ARF special designation: the protection needed,
> i did try once the idea of making VL a pointer to a reg rather than an
> immediate, and hoo-boy was it convoluted.
> 
> 
> you need to think through: what is the logic needed to implement 8-bit
> vector mask *when you do not have access to the mask yet*, how will the mask
> get into the Shadow Matrices, and how does it work for all possible elwidths
> and all possible values of VL.

obviously it's just another dependency -- just like a data input.

The exact dependencies can be calculated at instruction decode time and stored in latches/flip-flops wherever they are needed.
Comment 79 Jacob Lifshay 2020-10-20 02:29:30 BST
(In reply to Luke Kenneth Casson Leighton from comment #76)
> (In reply to Jacob Lifshay from comment #74)
> > (In reply to Jacob Lifshay from comment #73)
> > > (In reply to Luke Kenneth Casson Leighton from comment #71)
> > > > 
> > > > we _could_ conceivably do bit-level DM subdivision onto 64 bit integer regs
> > > > but... no, please, no :)  it makes a mess of the "Register Cache" idea,
> > > > unfortunately.
> > > 
> > > it would totally work, those
> > 
> > oops forgot to finish:
> > 
> > all we need to do is treat those two mask-optimized int regs as separate
> > from the rest -- kinda like CTR is treated differently than the int regs.
> 
> i get it.  they still need bit-level DM tracking, and if they are designated
> as int regs as well it becomes even more hell, at the point where they
> interact with "real" int regs.
> 
> if they are treated as completely separate regs (SPRs in effect) they need
> their own opcodes, and i really don't want to go down that route.

all that's needed is the logic to decode the register field as-if it's part of the opcode field, and only for the few bitwise ops optimized for masks (and/andc/or) when the output reg and an input reg are mask regs -- otherwise it's decoded as a normal int operation and just has to wait for all bits to be ready just like any other op.

Everything else can just access them through the port on the mask regs connected to the integer data path. dependencies can be handled by setting multiple dependency bits for the relevent regs being set when the register number matches, rather than the 1 dependency bit that would be set for any normal integer reg.

> > > > 
> > > > whereas CRs we have the freedom *to* decide how many we want to extend it to.
> > 
> > the set of integer registers optimized for masking can be extended too,
> 
> please understand: it really is too complex to track the dependencies for
> something that specialised.
> 
> > without needing all the mess of CRs.
> 
> think it through, jacob: vector CMPs and standard Rc=1 vectorised operations
> *still require a minimum 64 CRs anyway*

No they don't, part of the idea I'm proposing is that CRs *aren't* vectorized. Rc=1 is reinterpreted to mean something different when the operation is vectorized, probably to write a single CR with a bit indicating if all results for the entire vector are zero, other bits' can be decided later.

> or we abandon CRs pretty much entirely for vectors and this is not an option
> i am happy to consider, it will cause havoc for gcc and llvm conpiler
> developers.

Trust me, it really won't cause havoc. GCC and LLVM both target architectures that use 1-bit per element/lane for mask vectors: AVX512 and AMDGPU (and probably more). In fact, in LLVM IR, vector compare operations return a 1-bit per element/lane vector. Anything else has to use platform-specific intrinsics or have special conversion code added in LLVM's backend.
In LLVM, vector operations are treated as a totally different kind of operation than scalar operations all the way through the compiler, having the vector operations have different semantics (e.g. input/output registers) than the scalar operations won't cause any problems. In fact, I'd guess that having vectorized CRs would cause more havoc since it's more unusual with vectors (nothing else has that that I've ever heard of).
Comment 80 Jacob Lifshay 2020-10-20 02:31:51 BST
(In reply to Luke Kenneth Casson Leighton from comment #77)
> jacob, can i please ask you to go over Thornton's book (p126) and Mitch's
> book chapters, with a view to gaining a full understanding of the augmented
> precise 6600 system?

Yeah, I'll go read them through again.

> the reason i ask is because the ideas that you advocate are extremely
> challenging for me to analyse their full implications and then also explain
> them to you, when you may not necessarily understand the explanation *or* i
> might have misunderstood, and we have a lot of difficulty working out which
> it is.

Sorry, not mentioning when I changed my mind and switched to a somewhat different idea probably contributed to the confusion.

> also i do not want to be the sole exclusive dependency for implementing
> something this complex.

I'm planning on helping.
Comment 81 Luke Kenneth Casson Leighton 2020-10-20 02:49:27 BST
(In reply to Jacob Lifshay from comment #78)

> Yeah, that's true. I had been thinking that the shift operation would just
> be repeated each time new bits were available from their respective sources.
> Scrap that idea.
> 
> > an OoO system MUST track ALL objects regardless of size.
> 
> I was never advocating for not tracking all objects, just some of the
> objects are different kinds of objects and we should treat them differently.

yehyeh, i get it: it was that difference, trying to think through "how would this proposed idea work as far as DMs go", it's really difficult.

> A demo datapath:
> https://libre-soc.org/3d_gpu/int_regs_as_masks.dia.png
> https://libre-soc.org/3d_gpu/int_regs_as_masks.dia.svg

ah brilliant.  will take a look tomorrow.
Comment 82 Luke Kenneth Casson Leighton 2020-10-20 15:45:08 BST
(In reply to Jacob Lifshay from comment #51)

> 5. It's ok to deviate from how Power's scalar ISA does things when there's a
> better way.

just to come back to this, i wrote up the fundamental design principles at the top here:
https://libre-soc.org/openpower/sv/

these are extremely important because, the opposite of the advantages is: if we deviate too far from the idea of vectorising scalar OpenPOWER in a pure "block repetition of scalar" form it not only becomes far too complex to implement, OPF which has members who have far more experience than we do will reject it.

things like cutting out XER.SO (and its propagation into CRs) have always been a massive performance problem for PowerISA implementations: Paul told me that the only places it's really used as a result is in test suites!  therefore removing XER.SO from the vectorisation is easily justifiable.

likewise we *might* be able to make a good case for putting CA into CRs (in place of the SO bit) because this would allow massive chains of big integers to be dropped into the OoO pipelines, to create 1024-long adds and beyond.
Comment 83 Luke Kenneth Casson Leighton 2020-10-20 16:51:08 BST
jacob although earlier i said that the concept of lanes doesn't exist, there is actually a physical way to do "lanes" in the original Cray sense, by reducing both the FU-REGs and FU-FU Dependency Matrices to sparse matrices.  feel free to add more example layouts to this page btw (so we can visualise proposed designs):

https://libre-soc.org/openpower/sv/example_dep_matrices/

basically how that works is, whilst each ALU is still 64-bit SIMD capable the basic assumption is that vector operations are issued on "aligned" register boundaries starting at multiples of some arbitrarily-determined amount (4 in the example).

thus the following is "valid":

* VL=12
* ADD.64 r0 <- r16, r32

where the following is NOT:

* VL=12
* ADD.64 r0 <- r17, r25

because 17-0 is not a multiple of 4, and neither is 25-17

this latter vector-instruction would be done by the *scalar* ALUs (S-ALU1, S-ALU2, S-LOGIC1, S-LOGIC2).

note that all ALUs are *still scalar*, it's just that - and this is the important bit - the instruction issue engine *never issues illegal combinations for which there does not exist a Dependency Matrix cell*

note also the following:

* there is full FU_Regs DM cell coverage for FUs marked "S"
* there is full FU-FU DM cell coverage for FUs marked "S"
* there is full FU-Regs and FU-DM cell coverage for FUs marked "L0"
* there is full FU-Regs and FU-DM cell coverage for FUs marked "L1"
* there is full FU-Regs and FU-DM cell coverage for FUs marked "L2"
* there is full FU-Regs and FU-DM cell coverage for FUs marked "L3"

what i am not sure about is whether to add DM cells inter V-S.  this would allow at least some operations such as the following to be done:

* ADD.64 r7 <- r16, r32

because whilst the src1 and src2 (r16, r32) can be allocated across Lanes 0-3, the destination would otherwise have to be stored in the regfile, if there is another operation that needs r7, *ALL* operations would entirely stall until that result had been written in r7, and only then could the new Read Hazard be created.

if however no such follow-up operation needing to read r7 is issued then no stall would be needed.

if however we leave those inter V-S cells blank, then such operations as "ADD r7<-r16,r32" would need to be done in the "S" FUs, jamming them up by mixing amongst scalar operations.

the reduction in the number of gates needed in the DMs is... massive.  also there is a massive reduction in the wiring/routing needed between regfiles.  i.e. the regfiles can also be stratified along similar "lanes" arrangements.

i know.  it's horrendously complex.  we've literally got to invent terminology as we go along.  i have no idea what to call this.
Comment 84 Luke Kenneth Casson Leighton 2020-10-20 19:33:13 BST
(In reply to Luke Kenneth Casson Leighton from comment #82)
> (In reply to Jacob Lifshay from comment #51)
> 
> > 5. It's ok to deviate from how Power's scalar ISA does things when there's a
> > better way.

to make it clear: because of the potential for rejection by OPF and by other implementors (hardware and software) if the deviation is too great, i disagree that deviation should be done "purely because it's better".

by contrast (XER.SO) it is very easy to make a case for improving things (when parallelised) where PowerISA is, due to the pace of innovation in silicon, completely broken.  XER.SO was great in 1995 but the READ-MODIFY-WRITE hazards it created, which were no serious performance bottleneck back then, make even scalar high-performance PowerISA in 2020 utterly borked, let alone if it's vectorised.

> just to come back to this, i wrote up the fundamental design principles at
> the top here:
> https://libre-soc.org/openpower/sv/
Comment 85 Luke Kenneth Casson Leighton 2020-10-20 21:28:53 BST
(In reply to Jacob Lifshay from comment #78)

> A demo datapath:
> https://libre-soc.org/3d_gpu/int_regs_as_masks.dia.png
> https://libre-soc.org/3d_gpu/int_regs_as_masks.dia.svg
> 
> The demo datapath leaves the FU registers implicitly part of the
> corresponding ALU/FU combos due to me not wanting to draw for 4hr.

:)

so the bit that needs drawing out is the sentence "mask registers are tracked by the dependency matrices".

draw it with pen, paper and a ruler (i have 3 books with graph paper now) if it's quicker.

or: overlay this diagram, editing it with gimp. https://libre-soc.org/3d_gpu/reorder_alias_bytemask_scheme.jpg

it doesn't have to be pretty
Comment 86 Luke Kenneth Casson Leighton 2020-10-22 03:52:27 BST
just a reminder jacob of what we have to track, register-wise:

* 8 "fast" SPRs (CTR, LR, TAR, SRR0/1)
* MSR is also in the "fast" list although
  i believe it may need its own DM bits
* 3x XER bits, currently 2 wide, SO CA OV
* 32 INTs (will be 128)
* 32 FP (will be 128)
* 8x CRs (proposed 128) as 4 bit

(all other SPRs i am recommending a "stall and flush")

even as a scalar processor that is a MASSIVE amount.  around 75 wide FU-REGs!

no we cannot not have the CRs (in scalar mode) as "unmanaged" because this is how you get catastrophic register corruption.  we cannot ignore them either (scalar non-LibreSOC mode has to be fully OpenPOWER compliant)

therefore we simply have to have CR Dependency Tracking, for all 8 CRs.

in addition to that, the sheer number of regs means that we also need "register caches" to get the DM size down to "sane" levels.

if we have to do _that_ then including CRs and extending those to 128 is neither difficult nor problematic once the code is written for INT and FP.

whilst it may then on the face of it seem a  perfectly reasonable next evolution to add bitlevel DM tracking of an int that is specially treated as a mask, in a "register cache" context this actually means *bit level PRF ARF tracking*!

this does not seem to be sane :)
Comment 87 Luke Kenneth Casson Leighton 2020-10-23 09:18:15 BST
(In reply to Jacob Lifshay from comment #38)

> Why would you ever need to handle exceptions in the middle of a cmp
> instruction?

coming back to this, re-reading: all ops that produce CRs (incl. Rc=1) fall under the same question.  actually, all vector ops do, regardless of whether the results are

* vector of 64 bit regs only (Rc=0)
* vector of CRs only (cmp etc)
* vector of tuples 64bitreg+CR (Rc=1)


long latency pipelines/FSMs (SIN, COS, CORDIC) times long VL could turn out to be several thousand cycles which will definitely begin to interfere with interrupt servicing.

to solve this i spent some considerable effort working out how to do reentrant SV, which involves storing the current loop counter because it is effectively a "sub Program Counter".  if i recall correctly i think i even named it "sub-PC".

it requires the rule that the elements must be "sequentially completed" i.e. if you cancel element 3 then it is prohibited to allow 4 and above to hit the regfile.

the sub PC on return continues from the last non-cancelled, non-completed element.

this in combination with shadow cancellation gives the ability to drop incomplete vector elements on the floor, service the interrupt, and return to where things left off.

realistically you only do this for high priority interrupts (NMIs) because it's quite expensive and disruptive to drop large amounts of in-flight results.
Comment 88 Luke Kenneth Casson Leighton 2020-10-23 21:28:49 BST
i keep thinking of small things here and need to record them :)

the issues with not breaking predicates down into "one per element" is:

1) the mask must be read (as a scalar) by a specially created Function Unit that, like the Branch Unit, has a Shadow Matrix row that pulls die/pass for each unit depending on each bit being 0 or 1 respectively.

2) for each in-flight instruction that we want predicated *there must also be a corresponding predicate Function Unit*

3) the distribution of those bits to SIMD units gets particularly hairy.

contrast this with the situation where predicate bits come from a register *on a per element basis*. (CRs happen to already exist in PowerISA and consequently are a good match here)

1) predicated vector instructions may be issued where:

* the source register element
* the dest register element **AND**
* the predicate bit register element

these may all be issued **DIRECTLY** to a Function Unit **WITHOUT** requiring an intermediary Predicate Unit in the way whose role is to split out bits of a larger register.

in other words a "non-predicated" scalar operation is one where by default the predicate source is implicitly hardwired to an immediate "1" indicating "always do this operation".

this is pretty trivial

2) where SIMD is involved is a little trickier but also reasonably practical.

look at where we have had to add CR "full" read ports.  rather than have the CR Pipeline be forced to have to do 8 CR reads or writes (mtcr, mfcr) the *entire* CR 0-7 is read/written via a special 32-bit-wide regfile port.

on detecting the situation where SIMD needs to be deployed the "full" CR port may be read, giving 32 bits containing 8 CRs.

a) for 8x8bit SIMD these entire 8 CRs can be thrown at a single 64 bit SIMD FU.

one 32bit CR read will go along with one 64 bit source reg read, it is just that VL jumps by 8 elements at a time.

b) for 4x16bit SIMD this is slightly hairy in that the 1st 4 CRs (CR0-CR3) need to be thrown at one FU (element n) and the 2nd 4 CRs (CR4-7) at another FU (element n+1)

here VL will be jumping by 4 each time, and although the exact same 32bit CR read is a Read Hazard for odd/even (MSBs ignored for odd, LSBs ignored for even) FUs we *may* be able to reduce the number of regfile reads by "broadcasting" the read to 2 simultaneous CompUnits.

c) for 2x32bit it is actually potentially more optimal to just have 2x possible single-CR predicate registers per FU

OR

d) we considered splitting FUs down into 32 bit anyway (HI32 reg, LO32 reg, HI and LO *collaborate* to do 64 bit calculations) and under these circumstances a 32 bit predicated vector operation would have *2 CR predicates anyway*: one for HI32, one for LO32

however in all circumstances it is critical to note that a "special Predicate Function Unit" neither exists nor is needed.

in other words the Issue Unit can calculate simply based on elwidth, VL, and the current Sub-PC value (0 to VL-1) exactly what Dependency Requests to send on to the DMs.

this simplicity and predictable regularisation will become critically important when it comes to doing multi-issue, which requires that transitive DepMatrix relationships be set up between instructions in the same issue batch.

note that the above is also possible when using an int register as a predicate however the caveats (design complexity disadvantages) apply from comment #86 and *do not apply* when using CRs.
Comment 89 Luke Kenneth Casson Leighton 2020-10-25 17:02:59 GMT
(In reply to Luke Kenneth Casson Leighton from comment #88)

> however in all circumstances it is critical to note that a "special
> Predicate Function Unit" neither exists nor is needed.

when i designed the original RV version of SV i had envisaged, as explained in earlier comments, using special "Predicate Function Units" which requested the (one, scalar, 64 bit int) predicate, cast the shadow across the operations it controls (all of the FUs calculating elements covered by that predicate) and on eventually reading that (one, int64) reg, issuing the required die/ok signals.

this you kindly pointed out, Jacob, interferes with Cray-style "chaining", which is something i had not taken into account.

there are other limitations as well: the use of a Predicate FU means that we have worst case double the number of FUs to lay down in the DMs due to each predicated instruction requiring the activation of not one FU but *two* FUs.

this is so costly that we really have to eliminate this approach unless we genuinely have no alternative.

therefore we are left with "one-for-one" options: one element, one predicate bit.  or more to the point: one DM row per predicate bit, one DM row per source, one DM row per dest.

[optimisations on top of this such as reading 8 CRs are still one-for-one, it's just that they are onebatch-for-onebatch]

contrast this with "one DM row covering multiple predicate bits that require a Predicate FU to distribute out to multiple ALUs".

any such proposal that involves "multiple groups of predicate bits covered by a single DM row which have to be distributed out by way of a Predicate FU" are just too costly.
Comment 90 Luke Kenneth Casson Leighton 2020-10-26 09:24:30 GMT
https://libre-soc.org/openpower/sv/predication/

jacob i updated this page to include a number of ideas: i will add a couple more for completeness (such as adding a new predicate register type and associated instructions)

i included what i *think* you might mean by the chunked mask idea, please do review what i wrote.

i also did an implementation analysis: unfortunately, the change-over between when the underlying scalar int reg switches from chunked-mask to actual integer is... exceedingly complex, particularly when combined with the (necessary) regfile cacheing / virtualisation.

as always though we need to be thorough in the comparative analysis and complete and document all ideas.
Comment 91 Luke Kenneth Casson Leighton 2020-11-04 22:37:07 GMT
just some notes (which i will edit into comment 0 later) it took a while to explain the various different predication ideas, however at the end of it Paul came up with a crucial insight:

that whatever we propose to OpenPOWER for inclusion in the ISA, if it is complex, it also has to show a corresponding increase in benefit.

i.e. if it is simpler, the "justification barrier" will be less.

i think this is very important to add to each proposal for predication, here.  i have not-exactly done that but not explicitly.

my point being, Jacob, that, leaving micro-architectural implementation details entirely aside: to modify the OpenPOWER ISA (drop CRs when vectorised for example) has to have a clear and substantially compelling case for doing so when compared to the much simpler case of element-stratified vectorisation of CRs, respecting the principle that SV is a "hardware for-loop around essentially unmodified *scalar* pipelines"

XER.SO on the other hand is so disastrous to parallelism that it is easy to justify not supporting it at all in SV Mode
Comment 92 Jacob Lifshay 2020-11-06 05:09:36 GMT
I think it'll be worthwhile to also ask what people think of the different options on comp.arch (since they probably have more experience with different SIMD/Vector ISAs), I'll start a thread comp.arch to that effect when I'm not brain-dead (probably tomorrow).
Comment 93 Luke Kenneth Casson Leighton 2020-11-06 10:31:29 GMT
(In reply to Jacob Lifshay from comment #92)
> I think it'll be worthwhile to also ask what people think of the different
> options on comp.arch (since they probably have more experience with
> different SIMD/Vector ISAs),

that's a good idea.  although caveat / bear in mind: as a summary, the
"mileage may vary" so to speak.  they're not being paid money, so the
currency, if you will, is instead "interaction and insights".  every
time i post there i tend to stick around and ask questions and provide
observations and insights on things _not_ related to what i (we) are
doing.

> I'll start a thread comp.arch to that effect
> when I'm not brain-dead (probably tomorrow).

melted :)

what i will do is, i will leave it to you to "lead" that thread ok?
hmm probably best to raise a new bugreport, sub- of this one.
(edit: bug #527)
Comment 94 Luke Kenneth Casson Leighton 2020-11-18 13:52:37 GMT
https://bugs.libre-soc.org/show_bug.cgi?id=238#c35

realised that SVPrefix swizzle should not be under Compressed discussion
Comment 95 Luke Kenneth Casson Leighton 2020-11-18 15:20:36 GMT
(moving to bug #213)

(In reply to cand from comment #36)
> I guess we're talking past each other. I'm saying using a lookup table lets
> you save bits. Instead of 4+8 per vec4, you have 8.

let's work out why.

there are two separate and distinct things needed here (both quite normally provided by swizzle)

1) the ability to select 1, 2, 3 or 4
   parts of a vec4 to perform the
   vector-computation on.  examples:

   # select 2 elements from a XYZW vec4
   fadd v1.XY, v2.WZ, v3.WZ

   # select 3 elements from an ARGB vec4
   fmul v1.RGB, v1.RGB, v3.AAA
 
   this latter would be for example to 
   linearly multiply the RGB by the 
   transparency (A of ARGB)

2) the ability to select any part of a vec4 to place it into any other position in a vec4.


question:

   how, if there is only 8 bits, is it possible to specify that some of the vec4 elements are to be ignored?

you can't... unless there is a predicate mask.

there are 2 possible ways that can be encoded, due to a quirk of swizzle:

1) a 4 bit mask.  the elements in the vec4 with 0 set are ignored, just as with standard predication.  in fact, it is predication.

2) use an 8 bit swizzle to move things into the right order, even if they are not to be used... and then set SUBVL to 2 or 3, ignoring the upper elements.

this latter is wasteful.  the mv takes place, taking up register port bandwwidth, but the values are thrown out? doesn't seem wise to me.

BUT

with SVPrefix containing the SUBVL we have a trick: the SUBVL applies to that operation, right there, right then.

therefore we *do* have all the information needed to ignore the unneeded bits of the 2x2x2x2 swizzle mask.

(that means we can actually fit 2 of them into 16 bits for a SV-64 encoding!)

unfortunately though this trick will still require a follow-up MV to get the altered elements back into their target vec4 positions:

    # select 2 elements from a vec4,
    # making a vec2 temporary
    SUBVL=2 fadd v2.YZ, v3.XW, v4.YW

    # now get the contents of v2 back into
    # v3 XW positions...  errr... how?
    SUBVL=2 fmv v3.XW, v2.YZ

would that actually work? y'know, i think it might.  actually it would be:

    SUBVL=2 fadd v2.YZXX, v3.XWXX, v4.YWXX
    SUBVL=2 fmv v3.XWXX, v2.YZXX

except that because SUBVL=2 the top 2 swizzle indices would be ignored.

what's your thoughts there?
Comment 96 Luke Kenneth Casson Leighton 2020-11-18 15:33:17 GMT
(In reply to Luke Kenneth Casson Leighton from comment #95)

> there are two separate and distinct things needed here (both quite normally
> provided by swizzle)
> 
> 1) the ability to select 1, 2, 3 or 4
>    parts of a vec4 to perform the
>    vector-computation on.  examples:

> 2) the ability to select any part of a vec4 to place it into any other
> position in a vec4.

neither of these things are possible to cover together with a lookup table of less than 4+8 bits.  or 2+8 bits.

you can however say that if SUBVL=2 then if you have 16 available bits you can:

* use the 1st 4 for src1
* use the 2nd 4 for src2
* etc

and for vec3

* use the 1st 6 for src1
* use the next 6 for src2

finally for vec4 you have to compromise
because 2x 8 bits only allows you to cover src+dest, or src1+src2.
Comment 97 cand 2020-11-18 15:59:06 GMT
You're right that I missed the option to ignore an element, however adding that only makes it 5^4 = 625 options, which fits in 10 bits instead of 12.

Wrt your move example, I'm not getting why the move is necessary.
SUBVL=2 fadd v2.YZXX, v3.XWXX, v4.YWXX
SUBVL=2 fmv v3.XWXX, v2.YZXX

Why not directly
SUBVL=2 fadd v3.XWXX, v3.XWXX, v4.YWXX
?
Comment 98 Luke Kenneth Casson Leighton 2020-11-18 16:06:13 GMT
(In reply to cand from comment #97)
> You're right that I missed the option to ignore an element, however adding
> that only makes it 5^4 = 625 options, which fits in 10 bits instead of 12.

* 4 bits for a predicate mask
* 2+2+2+2 bits for indices X, Y, Z, W.

which is 12 bits

or

* 2 bits for SUBVL=1/2/3/4
* 2+2+2+2 bits for indices X, Y, Z, W

which is 10 bits.  actually the SUBVL=2 applies globally to the entire
instruction so it is 2 + (2+2+2+2) * Num_of_src_and_dest_operands


> Wrt your move example, I'm not getting why the move is necessary.
> SUBVL=2 fadd v2.YZXX, v3.XWXX, v4.YWXX
> SUBVL=2 fmv v3.XWXX, v2.YZXX
> 
> Why not directly
> SUBVL=2 fadd v3.XWXX, v3.XWXX, v4.YWXX
> ?

how many bits required to specify the full set of combinations of swizzle
positions on src1?

how many bits required to specify the full set of combinations of swizzle
positions on src2?

how many bits required to specify the full set of combinations of swizzle
positions on dest?

what is the total of those three?

can any optimisations be made on that? (the answer to that question is no
because they are fully independent indices on all src and all dest)

how much space can we spare in an encoding without it going completely insane?
answer: 16 bits.

work backward from the answer "16 bits is what can be spared".
Comment 99 cand 2020-11-18 16:31:57 GMT
If you mean 24 > 16, then the example was confusing - if you meant to illustrate dst can't have a swizzle in a 3-component instr, then it shouldn't have had one.

However that is a very common pattern in shaders. "foo = bar.xxyy + baz.zzzz" - the dst vec very often has no swizzle. This would need measuring, perhaps using the shaderdb from mesa, but my gut feeling is that the extra move for when dest swizzle is needed would not be needed often. A more compact encoding may be worth it.
Comment 100 Luke Kenneth Casson Leighton 2020-11-18 16:54:53 GMT
(In reply to cand from comment #99)
> If you mean 24 > 16, 

yes, because src1, src2, dest requires 3x8 which is greater than 16.
however... you just highlighted something important below...

> then the example was confusing - if you meant to
> illustrate dst can't have a swizzle in a 3-component instr, then it
> shouldn't have had one.

yes, agreed.
 
> However that is a very common pattern in shaders. "foo = bar.xxyy +
> baz.zzzz" - the dst vec very often has no swizzle.

interesting.


> This would need
> measuring, perhaps using the shaderdb from mesa, but my gut feeling is that
> the extra move for when dest swizzle is needed would not be needed often. A
> more compact encoding may be worth it.

and if the compiler can track the positions of x y z and w as they swap
"lanes" - across vec4[0/1/2/3] - then you can pre-arrange them all so that they *end up* in the right (final) destination positions, with that being (by convention, for convenience, always 0/1/2/3.

this is probably why there is no swizzle, ever, on dest.

good catch, lauri.
Comment 101 Jacob Lifshay 2020-11-19 02:34:16 GMT
One additional note for swizzling: it's very common to want to put constant 0 or 1 in elements, so, if there's space, I think we should try to encode that in the swizzles.

I would expect there to be circuitry in the instruction decoder to calculate which input elements are actually used by the swizzle and skip reading registers for the unused input elements, the circuit shouldn't be more than a few dozen gates. That way, we don't have to use up additional bits on something we could trivially calculate.

Taking both those into account: 6 options for 4 elements gives 6^4 = 1296 combinations -- 11 bits. I'm sure we could find a relatively simple encoding for that.
Comment 102 Luke Kenneth Casson Leighton 2020-11-19 03:56:55 GMT
(In reply to Jacob Lifshay from comment #101)
> One additional note for swizzling: it's very common to want to put constant
> 0 or 1 in elements, so, if there's space, I think we should try to encode
> that in the swizzles.

that looks familiar... *click*
https://bugs.libre-soc.org/show_bug.cgi?id=139#c68
Comment 103 Luke Kenneth Casson Leighton 2020-11-19 19:04:04 GMT
(In reply to Jacob Lifshay from comment #101)
> One additional note for swizzling: it's very common to want to put constant
> 0 or 1 in elements, so, if there's space, I think we should try to encode
> that in the swizzles.

rright.  ok i remember the discussion, it was around VBLOCK where there was more space.  for SVP the prefix pressure was too great so if i recall we were thinking of separate swizzle and swizzlei instructions and using macro-fusion.

> I would expect there to be circuitry in the instruction decoder to calculate
> which input elements are actually used by the swizzle and skip reading
> registers for the unused input elements, the circuit shouldn't be more than
> a few dozen gates.

that's basically exactly what scalar immediates do (see power_decode2.py, DecodeInA when RA==0) and what you are describing is no different.

put a VL or SUBVL external loop around it, and um it's trivial.

> That way, we don't have to use up additional bits on
> something we could trivially calculate.

the space used however in the opcode is not so trivial impact-wise howeverrr...
 
> Taking both those into account: 6 options for 4 elements gives 6^4 = 1296
> combinations -- 11 bits. I'm sure we could find a relatively simple encoding
> for that.

well, i would prefer less complexity in the decoder.  i don't know if you're aware but right now, PowerDecoder2 with just integer scalar PowerISA 3.0B is a staggering 5,000 gates.

some reverse-engineering analysis of POWER9 determined it has a *2 stage* instruction decoder!

we do have a way out though: SV-C64 and possibly even SV-C48 (compressed 16 bit ISA with a 32 bit or 48 bit prefix).

what we could do is use 2 more major opcodes that work something like this:

* 5 bits v3.0B Major opcode(s)
* 11 bits SV Vector Context (incl SUBVL)
* 16 bits Compressed Instruction
* 16 or 32 bits "swizzle" and other data
  *including* immediate-or-swap

unfortunately we would either have to:

* set one of the CBank bits to indicate
  "swizzle mode"
* use two more precious v3.0B Major Opcodes
* start reserving "swizzle" Compressed
  opcodes.

using yet more v3.0B Major Opcodes we may actually have to start dropping instructions to do so.  candidates include mulli, twi, tdi and lq, with moving "sc" elsewhere and dropping madd as close second level priorities

adding special "swizzle instructions" is the Way Of SIMD Madness

using CBank settings seems the most sane approach although it too is costly in terms of state and space.

none of these options is good!
Comment 104 Luke Kenneth Casson Leighton 2020-11-20 00:37:15 GMT
(In reply to Luke Kenneth Casson Leighton from comment #103)

> * use two more precious v3.0B Major Opcodes

i did a review, there are around 14 major opcodes that can be used by not having VSX, twi, tdi, mulli, bcd and lq.

remember that those all "come back" by dropping into standard v3.0B compatibility mode.
Comment 105 cand 2020-11-20 07:30:14 GMT
It may be stating the obvious, but while integer 0 is float 0 is fp16 0, each has a different 1. (not sure if you intend to support fp16)
Comment 106 Jacob Lifshay 2020-11-20 08:00:45 GMT
(In reply to cand from comment #105)
> It may be stating the obvious, but while integer 0 is float 0 is fp16 0,
> each has a different 1. (not sure if you intend to support fp16)

yup, integer instructions would have different swizzle values than fp instructions. the swizzle value would be determined by element type. We could possibly get away with just 0 constants for int swizzles, since 1 is rarer, but fp should have 0 and 1.

We *need* fp16 to have an efficient gpu, since many programs use fp16 textures or fp16 framebuffers. We could possibly get away with just doing computations as fp32 and only supporting fp16 for load/store/conversion, but fp16 computations would be nice. While we're at it we could also add bf16 which is much more suited to machine learning.
Comment 107 Luke Kenneth Casson Leighton 2020-11-20 17:41:15 GMT
(In reply to Jacob Lifshay from comment #106)
> (In reply to cand from comment #105)

> fp16 computations would be nice. While we're at it we could also add bf16
> which is much more suited to machine learning.

creating a non-uniform SV extension is... anomalous.  by that i mean that applying SV Prefixes is either uniform and consistent or it is not called SV Prefixing.  i.e. the prefixes and the whole SV concept applies uniformly as an abstract independent for-loop with complete lack of knowledge of the element-based instruction, or not at all.

therefore it applies to FP16 LD/STs *and* to FP16 computations... or not at all.

(there are a very small number of circumstances where this is not true: vec4 normalisation and dotproduct are a couple of them.  i'm not comfortable with this but we have to be pragmatic)

to do otherwise actually explicitly requires actual interering hardware at the decoder level to get it to reject certain opcodes from being vectorised, making a hard and critical dependence between two layers of decoding that simply should not exist (which is why i am not happy about norm or dotproduct)

this is going to be enough of a nuisance as it is (certain opcodes simply cannot be vectorised, such as twi, sc and so on).

*underneath* the actual ALU may go "er i don't have FP16 HW so i will do this as FP32 then chuck away some bits" however that's down to individual implementors to make that decision and yet that decision still has absolutely nothing to do with SV Compliance.

regarding BF16, there is one free slot available in the 2bit encoding "elwidth". yes hypothetically these may be:

* bf16
* fp16
* fp32
* default

which means that applying the elwidth override to 64 bit opcodes gives us "full coverage" of available options, as long as an override can be applied to each of src and dest.

however one thing not in our favour is that OpenPOWER is designed to waste the full 64 bit reg on storing FP numbers as if they were always FP64.

the bits of an FP32 are *NOT* kept together in the LSBs/MSBs of a 64 bit reg: the mantissa is *automatically* re-encoded to be placed in the mantissa FP64 bits and likewise the exponent.

this is rather inconvenient because instructions to convert between FP32 and FP64 do not exist (because it's a nop), making storage of multiple FP32/FP16 vectors in an FP reg difficult to get out when we want to convert them to FP32/FP64 vectors.

we may be forced to add conversion opcodes here.  needs thought.
Comment 108 Luke Kenneth Casson Leighton 2020-11-20 18:11:19 GMT
(In reply to Luke Kenneth Casson Leighton from comment #107)

> which means that applying the elwidth override to 64 bit opcodes gives us
> "full coverage" of available options, as long as an override can be applied
> to each of src and dest.

i mean on conversion ops, MVs etc.


> *underneath* the actual ALU may go "er i don't have FP16 HW so i will do
> this as FP32 then chuck away some bits" however that's down to individual
> implementors to make that decision and yet 

i did this in spike.  it was far too inconvenient to add elwidth override interaction on FP16-FP32-FP64 arithmetic ao i did everything at FP64 and converted in and out as needed.

this had disadvantages in that some exceptions or rounding which should have occurred did not.
Comment 109 Luke Kenneth Casson Leighton 2020-11-20 21:09:38 GMT
(In reply to Jacob Lifshay from comment #101)
> One additional note for swizzling: it's very common to want to put constant
> 0 or 1 in elements, so, if there's space, I think we should try to encode
> that in the swizzles.

started writing this up

https://libre-soc.org/openpower/sv/vector_swizzle/
Comment 110 Luke Kenneth Casson Leighton 2020-11-21 10:51:58 GMT

    
Comment 111 Luke Kenneth Casson Leighton 2021-01-17 17:12:06 GMT
https://libre-soc.org/openpower/sv/svp64/appendix/

it just occurred to me that we actually need two different kinds of reduction:

* scalar accumulator O(VL)
* vector tree-based map-reduce O(VL log VL)

the first is dead simple to identify:

* destination is a scalar
* destination is one of the sources

the most obvious one there of the first type is FMA.
Comment 112 Jacob Lifshay 2021-01-17 20:04:30 GMT
(In reply to Luke Kenneth Casson Leighton from comment #111)
> https://libre-soc.org/openpower/sv/svp64/appendix/
> 
> it just occurred to me that we actually need two different kinds of
> reduction:
> 
> * scalar accumulator O(VL)

assuming you mean serial reduction, where none of the per-element operations can be run in parallel (except for a few special cases).

> * vector tree-based map-reduce O(VL log VL)

parallel reduction

> 
> the first is dead simple to identify:
> 
> * destination is a scalar
> * destination is one of the sources
> 
> the most obvious one there of the first type is FMA.
Comment 113 Luke Kenneth Casson Leighton 2021-01-17 21:27:07 GMT
(In reply to Jacob Lifshay from comment #112)
> (In reply to Luke Kenneth Casson Leighton from comment #111)
> > https://libre-soc.org/openpower/sv/svp64/appendix/
> > 
> > it just occurred to me that we actually need two different kinds of
> > reduction:
> > 
> > * scalar accumulator O(VL)
> 
> assuming you mean serial reduction, where none of the per-element operations
> can be run in parallel (except for a few special cases).

ah good point.  MIN/MAX, XOR, OR, AND are definitely paralleliseable (into an accumulator), probably MUL and ADD as well.  things like SUB, DIV, those are a little weird.

> > * vector tree-based map-reduce O(VL log VL)
> 
> parallel reduction

yep nice point, noted.
Comment 114 Jacob Lifshay 2021-01-17 22:17:09 GMT
(In reply to Luke Kenneth Casson Leighton from comment #113)
> (In reply to Jacob Lifshay from comment #112)
> > (In reply to Luke Kenneth Casson Leighton from comment #111)
> > > https://libre-soc.org/openpower/sv/svp64/appendix/
> > > 
> > > it just occurred to me that we actually need two different kinds of
> > > reduction:
> > > 
> > > * scalar accumulator O(VL)
> > 
> > assuming you mean serial reduction, where none of the per-element operations
> > can be run in parallel (except for a few special cases).
> 
> ah good point.  MIN/MAX, XOR, OR, AND are definitely paralleliseable (into
> an accumulator), probably MUL and ADD as well.

yup, for integer ops only (though fp max and min would also work, depending on the exact IEEE754 function used). anything that is an associative operator.

>  things like SUB, DIV, those
> are a little weird.

And the majority of floating-point ops.

though integer sub could be parallelized by doing negations and a parallel add-reduce.

> > > * vector tree-based map-reduce O(VL log VL)
> > 
> > parallel reduction
> 
> yep nice point, noted.

it just occurred to me that when context-switching in the middle of a parallel-reduction, the vstart register is not actually the starting index, so we should call it resume-step (the step at which the next SV instruction resumes its progress) or something.
Comment 115 Luke Kenneth Casson Leighton 2021-01-18 00:47:42 GMT
(In reply to Jacob Lifshay from comment #114)

> yup, for integer ops only (though fp max and min would also work, depending
> on the exact IEEE754 function used). anything that is an associative
> operator.

ah that was the word i couldn't remrmber

> >  things like SUB, DIV, those
> > are a little weird.
> 
> And the majority of floating-point ops.
> 
> though integer sub could be parallelized by doing negations and a parallel
> add-reduce.

yeah i pointed this out
 
> > > > * vector tree-based map-reduce O(VL log VL)
> > > 
> > > parallel reduction
> > 
> > yep nice point, noted.
> 
> it just occurred to me that when context-switching in the middle of a
> parallel-reduction, the vstart register is not actually the starting index,
> so we should call it resume-step (the step at which the next SV instruction
> resumes its progress) or something.

hmm hmm good point.  more descriptive/meaningful. needs to be in the SVSTATE SPR page