Bug 1056 - questions and feedback (v2) on OPF RFC ls010 (Simple-V Zero-Overhead Loop Prefix Subsystem)
Summary: questions and feedback (v2) on OPF RFC ls010 (Simple-V Zero-Overhead Loop Pre...
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL: https://libre-soc.org/openpower/sv/rf...
Depends on: 995 1047 1080 1176 1045 1084 1161
Blocks: 1096 1179
  Show dependency treegraph
 
Reported: 2023-04-13 19:17 BST by Luke Kenneth Casson Leighton
Modified: 2023-11-14 01:35 GMT (History)
4 users (show)

See Also:
NLnet milestone: NLnet.2022-08-051.OPF
total budget (EUR) for completion of task and all subtasks: 3500
budget (EUR) for this task, excluding subtasks' budget: 3500
parent task for budget allocation: 1012
child tasks for budget allocation: 1077 1114
The table of payments (in EUR) for this task; TOML format:
lkcl = { amount = 2200, submitted = 2023-10-10 } [jacob] amount = 1300 submitted = 2023-10-15 paid = 2023-11-10


Attachments
vle book 3.18.1 spr handling (224.98 KB, image/png)
2023-05-29 13:03 BST, Luke Kenneth Casson Leighton
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2023-04-13 19:17:59 BST
feedback now being sought on ls010 v2

---

actions:

* TODO: bug #1175    investigate mcrxrx mcrxr mtcr mfocr etc as "CR-based" ops
                     *not* CR-field-based SVP64. in the spec not implemented yet.
* DONE: bug #1161    fix EXTRA2/3 algorithm bug in spec and ISACaller
* TODO: comment #5   "features" cross-ref to sections.
* DONE: comment #46  base on v3.1B
* DONE: comment #4   bibliography of LDIR CPIR REP
* DONE:    use "defined word-instruction" /"defined 32-bit insn"
* TODO: comment #12  create hypothetical demo split-fields RA RT etc
        https://libre-soc.org/openpower/sv/rfc/ls010/hypothetical_addi/
* TODO: comment #5   illegal trap if regfile exceeded by element access
* TODO: comment #6   decide what to do on SPRs. trap-emulate critical
                     see comment #20
* TODO: comment #8   investigate Strict Program Order (use it)
* DONE: comment #5   vectorize. blech.
* DONE: comment #7   no camelcase. blech.
* TODO: comment #8   interrupts in Book III 7.5 p1250 appx
* TODO:    pubv3.1 III 7.5 p1262, "SVP64 Unavailable"? evaluate.
* TODO: comment #8   "Current Sub-Iteration  Counter" or similar.
* TODO: comment #8   "just as they avoid using Floating-Point, ...."
* TODO: comment #8   interrupt handler to know whether REMAP SPRs in use?
                     (comment #66 idea Interrupt offset if SVSTATE.SVme)
* TODO:    write up "Structure Packing" (parent: bug #953? bug #1019?)
* DONE     bug #1077 review SUBVL move to SVSTATE. answer No (3D shaders)
* TODO:    review bug #995 (alignment)
* TODO:    standard machine format for "Power ISA v3 page refs"
* TODO: comment #52  create "better" (trial) spec-mod page
                 https://libre-soc.org/openpower/sv/rfc/ls010/trial_addi/
* TODO: comment #71 performance counters (new issue)
* TODO:    bug #1116 Vector-Immediate (in RM.normal)
                 https://libre-soc.org/openpower/sv/normal/

completion log:

"hypothetical sv.addi" page:
* https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=6759db39e

v3.1B
* https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=e088994dbb

LDIR CPIR REP bib
* https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=efd662cb0a2

Vectorize
* https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=e7b958bd6
* https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=f560e31ee70

Defined word
* https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=e7b958bd6
* https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=337dbea2b80

CamelCase
* https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=5c79fe53ff
Comment 1 Luke Kenneth Casson Leighton 2023-05-13 12:31:30 BST
in bug #1083 the former space of predresult is to be allocated
to VLi as a high priority in DD FFirst. minor change to crops
also needed.
Comment 2 Luke Kenneth Casson Leighton 2023-05-14 17:13:50 BST
a rather annoying quirk showed up: LD/ST-update when changed to EXTRA3
and used for a linked-list-walking test came across a problem where
the "normal" check "assert RA!=RT" must now be

    "assert RA||EXTRA3_bits != RT||EXTRA3_bits"

with the usual method being to write out a prefix as ".long XXXXX"
followed by its 32-bit Defined Word (suffix) it will be necessary
as a temporary workaround to use pysvp64dis and SVP64Asm to bypass this
check and perform its own.

this needs to have a corresponding spec update as well as binutils
Comment 3 Luke Kenneth Casson Leighton 2023-05-15 12:54:37 BST
bug #1047 - linked-list pointer-chasing works with LD/ST-immediate including
LD-immediate-update.  still TODO is Indexed.
Comment 4 Paul Mackerras 2023-05-25 06:19:24 BST
Overall my main comment would be that this doesn't read like an architecture specification; rather it reads more like a journal article. It needs to state precisely and concisely and completely how the system operates. It doesn't need to argue for it or against other systems.

Title: "SVP64 Zero-Overhead Loop Prefix Subsystem" seems fine to me.

The RFC needs to be based on v3.1B rather than v3.0B (particularly since it depends on prefixed instructions).

Introduction section:

1st paragraph: The comparison with Z80 or 8086 instructions isn't appropriate or helpful. If you really want to do that then we need to have a references to bibliography entries that reference specific Z80 or 8086 architecture documents. In any case what you are doing here is far beyond anything that Z80 or 8086 ever did, so pointing to the descriptions of those instructions doesn't get us very far. Also, many (most?) readers will never have used a Z80 or 8086, so the reference is not illuminating.

In the last sentence you introduce a way of viewing this stuff only to then say that that isn't a good way to view it. So why introduce it at all?

The first paragraph just needs to say that this system is based on the idea of repeatedly executing a specific scalar instruction across a series of registers. Be specific and factual about what Simple-V does.

2nd paragraph: It is unacceptable in my opinion to use such a completely different convention here from the rest of the ISA. It can only cause massive confusion.

3rd paragraph: "Defined word" is not a concept in the architecture. "Defined word instruction" is, but that is to be parsed as (defined (word instruction)) not ((defined word) instruction).

In this and subsequent paragraphs, you keep hammering on the idea that the underlying scalar instruction is (a) a word instruction and (b) completely unchanged in its behaviour. Neither is in fact true. As to the unchanged behaviour, it is not clear why that would be so crucially important even if it were true.

Instead it would be far more helpful to put a complete list of the ways that the behaviour of the underlying instruction is (or can be) changed. (Note that such a list would then automatically prohibit any other behaviour changes.) The list would include changes to source and destination register numbers, bit-width of operation, suppression of operation (due to predication), writing a zero result (due to predication), arithmetic saturation, whatever "vectorised branch-conditional" is, and whatever else Simple-V allows that I have missed. Forward cross-references would probably be helpful here.

4th paragraph: the word "apparent" trips me up, because it leads me to expect that you will explain why these are not in fact real exceptions, but you don't, and in fact seem to end up saying that they are real exceptions (but that's good). Also, if you intend to remove post-increment, then just remove it for now and add it back later if you change your mind.

5th paragraph: You really can't prohibit future architects from doing anything. If nothing else, they could just remove this paragraph. At most you can say that doing certain things in future is likely to cause difficulties for implementators of future versions of the architecture. (You can't really even be sure about that, because future implementers might invent amazing new implementation techniques that would easily overcome any difficulty you propose.)

This paragraph is a bit like Parliament passing a law saying that Parliament may not in future pass a law allowing a particular thing. It's futile and ineffective.

6th paragraph: Mention of subset implementations is good. I have some reservations about overloading the term "Compilancy Subsets" since that already has another meaning, and to use it for two different things could be confusing.

I think it would be helpful here to mention briefly at least that the number of GPRs, FPRs and CRs gets increased in some of the levels of Simple-V.

To be continued...
Comment 5 Paul Mackerras 2023-05-25 06:21:52 BST
"SVP64 encoding features" section:

This is bewildering, because you plunge into details which are unexplained at this point, and the reader hasn't yet even been told that all SVP64 instructions are prefixed instructions or that bits in the prefix control various aspects of the iteration of the underlying instruction. So what are the "only 24 bits"?

Also, saying "need to be" implies that we are still at the design stage. But we're not; this document should be written for the situation where SVP64 is defined, known and accepted, and people just need to know exactly what it is.

Using the word "features" in the section heading seems to imply that you are still trying to sell people on the concept. Assume that they are already sold and now need to know the precise details of what it is. I suggest the section heading should be "SVP64 instruction encoding" instead.

At this point, what readers need to understand is the following:

* SVP64 instructions are always prefixed (64 bit) instructions.
* The prefix word controls various aspects of the vectorisation, and is defined below in this section.
* The suffix word is either a defined word instruction, or is the suffix of an instruction that has a PO9 scalar prefix (we may need a better term than "PO9 scalar prefix"). The two cases are distinguished by a bit in the SVP64 prefix (see below).
* 64-bit instructions other than those with a PO9 scalar prefix cannot be vectorised by SVP64.

Then there needs to be an explanation of the role of bits 6 and 7 in the SVP64 prefix, perhaps with a diagram. There needs to be an explanation somewhere else of the PO9 scalar prefix (perhaps in Book I Chapter 1), which you need to write and either include in this RFC or in a new RFC. You would then cross-reference to that description.

Then you can state that in order to fit the vectorisation parameters into bits 8 - 31 of the SVP64 prefix, several different formats are defined. Include a cross-reference to where those formats are defined, and indicate which bits of the prefix indicate which format is being used. Ideally you would give a table listing all of the defined formats and the bit values in the prefix that indicate each format.

The list of features you give is really more a list of vectorisation features. I suggest that such a list would be better in the Introduction, and should be a complete list with cross-references to where those features are defined.

This is probably the place also to specify that the suffix word cannot be an unvectoriseable word instruction (in the case where the prefix indicates that the suffix is a defined word instruction), define which instructions are considered unvectoriseable, and state that the consequence of attempting to execute an unvectoriseable instruction with a SVP64 prefix is a Hypervisor Emulation Assistance Interrupt (not Illegal Instruction Trap, which doesn't exist in the architecture).

By the way, I expect that we will need to convert to American "vectorize" rather than "vectorise". (On the general subject of -ise vs. -ize endings, the original edition of Fowler's Dictionary of Modern English Usage says "Most English printers follow the French practice of changing -ize to -ise; but the OED of the Oxford University Press, the Encyclopaedia Britannica of the Cambridge University Press, The Times, & American usage, in all of which -ize is the accepted form, carry authority enough to outweigh superior numbers.")
Comment 6 Paul Mackerras 2023-05-25 06:23:08 BST
"Definition of Reserved" section:

I have reservations about this. Generally in the architecture if you want this behaviour of taking a HEAI (Hypervisor Emulation Assistance Interrupt) on things that aren't recognized, you define an instruction field to cover the bits and specify a set of defined values for it. Then any other value in the field becomes a reserved value for the field (Book I section 1.3.3), and an instruction containing a reserved value in a defined instruction field is an invalid form. Book III section 7.5.18 then says that a HEAI may be generated when execution of it is attempted.

However, experience has shown that it is often more useful to define reserved fields, which must be ignored by the processor. We got into trouble with mfocrf, which was introduced in POWER4, because POWER3 didn't ignore reserved fields. The mfocrf instruction looks similar to mfcr except that it has non-zero values in some bits that were reserved in the pre-POWER4 definition of mfcr, and mfocrf was defined such that doing what mfcr does is an acceptable implementation of it. Thus if POWER3 had correctly ignored reserved instruction fields, we could have started using mfocrf in binaries without worrying about whether they might get run on a POWER3 or not. Instead, we had either to avoid mfocrf (missing optimization opportunities on POWER4 and later) or include alternate code paths for POWER3 vs. >= POWER4 (which adds overhead).

Also, the requirement for a trap when writing ones to reserved bits in SPRs is problematic. At the very least you need to exempt privileged code from this requirement (to avoid taking traps in context-switch code, where it is often quite difficult to handle traps sanely). I would strongly suggest specifying that the HEAI is to be taken when executing the next SVP64 instruction, rather than on the mtspr. (If you do insist on it being taken on the mtspr then you will need to add a case to Book III section 7.5.18.)
Comment 7 Paul Mackerras 2023-05-25 06:23:31 BST
"Definition of UnVectoriseable" section.

First off, let's not do CamelCase.

This material belongs in the SVP64 Instruction Encoding section, as previously mentioned.

The "Architectural Note" is really an "Engineering Note".

Eek, are you really saying that I can't work out how to decode the prefix without looking at the suffix? Ewwwwww....
Comment 8 Paul Mackerras 2023-05-25 06:24:27 BST
"Definition of Strict Program Order" section:

This material is a bit hard for the reader to understand when you haven't yet introduced any details about the iteration of the underlying instruction. I think this should wait until after you have talked more about how the iteration occurs.

1st paragraph: you define "Strict Program Order". How does this relate to other similar concepts defined in the architecture, specifically the sequential execution model defined in Book I section 2.2, "program order" defined in Book II section 1.1, and the permitted deviations from the sequential execution model described in Book III section 1.2.3? Is your term the same as any of those, or if not, how does it differ? I strongly recommend using the existing terms if at all possible.

2nd and 3rd paragraphs: the comparison and contrast with other ISAs is not appropriate or helpful. The architecture needs to stand on its own feet and not require references to other architectures. Also, the Power ISA usually uses the term "Current Instruction Address" rather than "Program Counter", so the term "Sub Program Counter" should be replaced by "Current Iteration Counter" or similar.

I think what you are trying to say here is that (a) Simple-V defines a specific order to the iterations it performs, (b) the iterations must appear to the program to be executed in that order, and (c) the Current Iteration Counter is exposed in the SVSTATE SPR, allowing interrupts to be taken within the sequence of iterations. (Point (c) may be Book III material in fact -- what happens if a program explicitly sets the Current Iteration Counter in SVSTATE to a non-zero value before executing an SVP64 instruction?)

Saying "Simple-V has been carefully designed" sounds self-congratulatory. I think you're just saying that all of the state needed to resume a sequence of iterations after an interruption at any point is available in the SVSTATE SPR. Is there more to it than that?

4th paragraph: Interrupts are Book III material, strictly speaking. You probably need to add a subsection somewhere in Book III Chapter 7 talking about interruption of SVP64 iterations.

You say "and the four REMAP SPRs if in use at the time". How is an interrupt handler to know whether the REMAP SPRs are in use?

Saying "Whilst this initially sounds unsafe ..." assumes a certain ignorance on the part of the reader which may not be justified. A Programming Note that says something along the lines of "Interrupt handlers and function prologs should generally avoid using SVP64 instructions until after Simple-V architected state has been saved to memory", with the possible addition of "just as they avoid using Floating-Point, Vector or VSX instructions until after the associated state has been saved to memory".

Saying "which is very rare for Vector ISAs" is at risk of becoming dated, and doesn't help our understanding of Simple-V itself.

5th paragraph: Defer this to the description of Parallel Reduction. It's not an exception to the general rule, and we don't know what Parallel Reduction is at this point, so it's confusing.

6th paragraph: Again, this is not an exception to the general rule, and we don't know how the remapping works at this point, so defer this to the description of remapping.

7th paragraph: stated more concisely, it seems that what you're saying is that using CR predication on an instruction that modifies CR gives boundedly undefined behaviour. Is that correct? If so it needs to be stated (or re-stated) at the point where CR predication is defined. (You could alternatively define that as an invalid instruction form (see Book I section 1.8.2), allowing an HEAI to be generated.)
Comment 9 Paul Mackerras 2023-05-25 06:26:17 BST
"Register files, elements, and Element-width Overrides" section:

I strongly disagree that the register file should be accessed in little-endian byte order when the processor is in big-endian mode. Requiring that will make Simple-V practically unusable in big-endian mode (just as saying that the register file has to be big-endian always would make Simple-V unusable in little-endian mode).

Instead the register file should be accessed in the endian mode of the processor, because that means that the relationship between array indices in memory and element numbers in the register file is the identity mapping (or at worst just the addition of a constant), regardless of endian mode, assuming that ordinary ld/std (or lfd/stfd) instructions are used to transfer data between registers and memory.

Elements are not unbounded arrays - there are only a finite number of them that exist. You don't specify what happens if you run off the end of the register file. The architecture needs to specify that.

The third dot point is not clearly expressed. I think it means that element-width overrides cause the register file to be considered as an linear array of chunks of that width (but the register number specified in the instruction is still interpreted in 64-bit units, right?).

2nd & 3rd paragraphs: in ANSI C, are you sure that indexing beyond the bounds of a union is defined behaviour? I'm not, and if it isn't, then ANSI C isn't the best language to use here.

The sentence about VSX along with the "Future specification note" don't add anything useful, so drop them. Also, "bounded fixed" sounds awkward.

5th paragraph: As I said, I think this is entirely the wrong choice.

6th paragraph: I think that by "Arithmetic and Logical" you really mean "Fixed-Point" (as opposed to Floating-Point or Branch). There are floating-point Arithmetic instructions.

6th & 7th paragraphs: Once again you are hammering the point that the instruction is not altered, but it is not clear why you need to belabour this.

Is it possible to request saturating arithmetic with the SVP64 prefix on an addi instruction? If so then that would certainly count as a fundamental change to the operation being performed.

In fact I think that changing the element width is a fundamental change. But that's fine - the important thing is that the changes to the instruction's operation are precisely and fully defined. Saying "nothing is changed" is obviously not precise or complete, some things are changed (e.g. element width). Saying "nothing fundamental is changed" doesn't help either, because there is no definition of "fundamental".

8th paragraph (starting "Element offset numbering"): I have no idea what "LSB0-sequentially-incrementing" means as opposed to "MSB0-incrementing". You seem to be projecting your confusion and lack of comprehension of big-endian bit numbering onto the reader here. (Note that I agree that little-endian bit numbering is generally easier to use than big-endian bit numbering. But big-endian bit numbering is a perfectly well-defined and self-consistent convention.)

9th paragraph: This is all written from the perspective of someone who only understands little-endian memory addressing. There are people, including much of the audience for this ISA, who are very used to big-endian bit and byte ordering and find it more natural and understandable than little-endian.

Trying to use a different convention in this part of the ISA from all of the rest of it risks unending confusion unless it is very carefully handled. It should be possible to define the element addressing in one place and then just use element addresses from then on. The element addressing function should hopefully be the only place where bit numbers need to appear.

10th paragraph: why is the similarity to the VSX definition important?

This section seems to be using a lot of words to try to express the idea that the set of elements operated on by the iterations of a SVP64 instruction might span several GPRs and FPRs. I think this is all made more difficult by the fact that we haven't yet really been told how the iterations work, or even given much idea of how many iterations there might be.

Page 5: now we've switched to python syntax, and we have this new term "VL" that we have no idea about, since you haven't yet talked about it. Is it absolutely necessary to use a new syntax here instead of expressing the loop the way loops are expressed in RTL elsewhere in the ISA?

page 5, 3rd paragraph: you don't need to say "deliberate and intentional" in bold as if that was somehow surprising. Hopefully everything in the ISA is deliberate and intentional. This whole paragraph seems to be a reiteration of what has been previously defined (and if it isn't then the previous definitions should be improved).

The point about not modifying parts of a register that don't hold an element that is part of the iteration is important, and I think the only new point here.

The rest of the material including the table showing the effect of the example is explanatory rather than normative, and as such should go in a Programming Note rather than the main text.

to be continued...
Comment 10 Paul Mackerras 2023-05-25 06:27:18 BST
Comments from IBM architects about LS010:

* Attempts to redefine “Reserved” more strictly for scope of SimpleV, which won’t be popular.
* Also seems to redefine bit/byte/element order compared to existing architecture.
  This will be unacceptable except possibly within the confines of the SimpleV chapter.
* Cross references that would make this comprehensible seem challenging.
* Same comments as the other SimpleV component RFCs.

The successful (with respect to being accepted as an architecture extension) combined RFC must have an organized flow to the description and must be defined from primitive concepts understood by the existing architecture or must define the concepts (e.g., not depend on references to other architectures or documentation). It must completely describe changes required in the non-SimpleV chapters to mate up with what’s going on in the new chapter. A significant part of this is to describe all of the changes to the sequential execution model and the interrupt chapter as grafts into the existing chapters.
Comment 11 Paul Mackerras 2023-05-25 06:38:23 BST
Questions about vertical-first mode:

In a VF loop, how does each instruction indicate which operands should be considered as vector operands? I think you would often want to have vector operands in different places in different instructions in the loop.  For example, one instruction might want to treat RA and RB as vector (but not RT) and the next might want to treat RT and RA as vector (but not RB).

Is it the case that a 32-bit instruction with no SVP64 prefix within a VF loop would have some Simple-V vectorization modifications applied to it? For example, would an add instruction (just the 32-bit add instruction, no SVP64 prefix) be subject to the element size indicated in the SVP64 word (prefix) at the start of the VF loop?

Or is it the case that if you want any vectorization modifications (e.g., element size specification) applied to an instruction, then you have to give it a SVP64 prefix, and if it doesn't have such a prefix, then it behaves exactly in every respect as it would if it were not in a VF loop?
Comment 12 Luke Kenneth Casson Leighton 2023-05-25 13:26:42 BST
(In reply to Paul Mackerras from comment #5)
> "SVP64 encoding features" section:
>
> At this point, what readers need to understand is the following:
> 
> * SVP64 instructions are always prefixed (64 bit) instructions.

there aren't any SVP64 instructions. there *really is* only
"32-bit Prefix followed by any Scalar 32-bit instruction".

it is absolutely prohibited to Prefix an UNDEFINED 32-bit word,
and it is absolutely prohibited to assume that just because
any given UNDEFINED 32-bit word has a prefix in front of it,
it can be ALLOCATED a new meaning.

i go over this in more detail here:
    https://libre-soc.org/openpower/sv/rfc/ls001/
    "Example Legal Encodings and RESERVED spaces"

> * The prefix word controls various aspects of the vectorisation, and is
> defined below in this section.
> * The suffix word is either a defined word instruction,

it is *always* a Defined-word-instruction, where that DWI **MUST**
have the exact same definition as if it had no prefix at all
(caveat: bar the niggles on elwidth).

i.e.: all operands MUST NOT change meaning, just because of prefixing



let us take addi as an example (because of paddi)
See Public v3.1 I 3.3.9 p76

here is the operands:

addi:

    DWI   : PO=14  RT RA SI

paddi: 

    Prefix: PO=1   R     si0
    Suffix: PO=14  RT RA si1

here is the RTL:

    if “addi” then
        RT ← (RA|0) + EXTS64(SI)
    if “paddi” & R=0 then
        RT ← (RA|0) + EXTS64(si0||si1)
    if “paddi” & R=1 then
        RT ← CIA + EXTS64(si0||si1)


what ABSOLUTELY MUST NOT HAPPEN is:

    else if “sv.addi” then
        RT ← 55 * (RA|0) + EXTS64(SI) << 5 + 99

and even more insane and damaging:

    else if “sv.addi” then # actually redefined as multiply
        RT ← (RA|0) * EXTS64(SOME_TOTALLY_DIFFERENT_OPERAND)

it should not really even be necessary to put this into the addi spec:

    SVPrefix: PO=9  (24-bit RM)
    Suffix  : PO=14 (RT RA SI)

why not? *because there is no change*: it is prohibited!

think about it.  if it was allowed, then even the spec goes
haywire.  pages and pages of:

    addi:      {PO14}     DWI addi
       DWI   : PO=14 (RT RA SI)

    paddi:            uses {PO1-PO14} but still "addi" (in effect)
       Prefix: PO=1  ( R si0)
       Suffix: PO=14 (RT RA si1)

    sv.somethingelseentirely  {PO9-PO14} conflicts with addi?? WTF????
       SVPrefix: PO=9  (then 24-bit RM)
       Suffix  : PO=14 (RT RA differentoperand)     XYZ99-Form


total nightmare just from a spec-writing perspective,
as well as damaging Decode.

the only things you are "allowed" to do is to extend the reg numbers.
i do appreciate that given that PO1 does in fact "change" immediate
operands (si0 || si1), flipping from using SI to instead using
split-field si0-si1, it seems like i am freaking out, but i really
*really* don't want to hit the Power ISA Spec with 200+ changes
to the RTL and instruction definitions.

if you really really are asking for split fields to be introduced
for RT, RA, RS, RB, BA, BFA, FRS, FRC, BT (basically everything)
then i feel the entire suite - over 200 {PO9-DWI}s - should be
autogenerated.
Comment 13 Luke Kenneth Casson Leighton 2023-05-25 14:48:38 BST
i started on an illustrative page, which gives some idea as to
what addi would be turned into (if done in a naive but "at-the-trenches"
way).  bear in mind that *200* of these modifications would be needed (!!!)

   https://libre-soc.org/openpower/sv/rfc/ls010/hypothetical_addi/

the absolute most important thing - not really emphasised enough before - is
that the operands MUST NOT change (bits, meaning)

it is ABSOLUTELY PROHIBITED to have this, just because {PO14} is {PO9} prefixed:

    sv.addi RT,RA,SI,R

aside from anything, where on earth would you get the contents of the "R"
field from? there's certainly no room in the *PO9-Prefix* bits.

i did actually try this once (redefinition of the Suffix operands based
on Prefixing).  it was awful.  six months later i was still unwinding
the damage done.
Comment 14 Luke Kenneth Casson Leighton 2023-05-25 16:10:49 BST
(In reply to Paul Mackerras from comment #4)

> In this and subsequent paragraphs, you keep hammering on the idea that the
> underlying scalar instruction is (a) a word instruction 

this is mandatory and inviolate, yes.

> and (b) completely unchanged in its behaviour.

except for elwidth-overrides, yes.

it goes like this:

  for i in range(VL):
      {DO_SCALAR_DEFINED_WORD_INSTRUCTION}

where predication would be:

  for i in range(VL):
      if predicate_mask[i]:
          {DO_SCALAR_DEFINED_WORD_INSTRUCTION}

the instruction *hasn't* changed - not in the slightest bit - just because it
is predicated.

> Neither is in fact true. As to the unchanged
> behaviour, it is not clear why that would be so crucially important even if
> it were true.

dramatic simplification of Decode Phase.

> Instead it would be far more helpful to put a complete list of the ways that
> the behaviour of the underlying instruction is (or can be) changed.

strictly-speaking: element-width overrides, and that's it.
that's the *only* change, and that's covered by ls005.xlen.

  for i in range(SVSTATE.VL):
      if PREFIX.elwidth == 8:
          {DO_SCALAR_DEFINED_WORD_INSTRUCTION... BUT @ XLEN=8-bit}
      else if PREFIX.elwidth == 16:
          {DO_SCALAR_DEFINED_WORD_INSTRUCTION... BUT @ XLEN=16-bit}
      else if PREFIX.elwidth == 32:
          {DO_SCALAR_DEFINED_WORD_INSTRUCTION... BUT @ XLEN=32-bit}
      else if PREFIX.elwidth == 64:
          {DO_SCALAR_DEFINED_WORD_INSTRUCTION... BUT @ XLEN=64-bit}

that *still* does not imply that the actual Defined Word Instruction
itself "changed" [became a multiply instead of an add, had access
to other operands, etc.]

the looping *REALLY IS* a completely separate concept from the
{execution-of-the-Defined-Word-Instruction} - it *REALLY IS*
a Sub-Current-Execution-Counter (CIA.SVSTATE)


(strictly-speaking not so for Register numbers, which techincally
 become "split fields" - but i demonstrate here why attempting to
 put those into the actual RTL - of every single damn instruction -
 would be absolute hell
 https://libre-soc.org/openpower/sv/rfc/ls010/hypothetical_addi/ )
Comment 15 Luke Kenneth Casson Leighton 2023-05-25 18:08:26 BST
(In reply to Paul Mackerras from comment #9)
> "Register files, elements, and Element-width Overrides" section:
> 
> I strongly disagree that the register file should be accessed in
> little-endian byte order when the processor is in big-endian mode. Requiring
> that will make Simple-V practically unusable in big-endian mode (just as
> saying that the register file has to be big-endian always would make
> Simple-V unusable in little-endian mode).

(i am assuming you are referring to register-to-register operations
 which in and of itself is a massive "ask")

this in effect is equivalent to asking for the PC to run backwards,
because of the sub-loop running 0..VL-1 running backwards (VL-1 downto 0)

but here's the thing: if you have a src and a destination reg and
you go in reverse, on both, it makes no difference in most circumstances!

    for i in 0..VL-1
         GPR(RT+i) <- GPR(RA+i) + EXTS64(SI)

is no different from:

    for i in VL-1 DOWNTO 0
         GPR(RT+i) <- GPR(RA+i) + EXTS64(SI)

as long as there is no overlap on RT..RT+VL-1 with RA..RA+VL-1
then back-end Hardware will actually not care *at all*.

what you are MUCH more likely to be expecting is:

    for i in 0..VL-1
         reversed_i = VL-1-i
         GPR(RT+i) <- GPR(RA+reversed_i) + EXTS64(SI)

and the first most crucial thing is, how the hell is that even possible
to express when there are no spare bits in the Prefix?

we went through this exercise over 2 years ago, it was so complex i
had to put my foot down and say NO.  instead going with REMAP to
perform these types of "optional inversions".


if you really need to load in backwards order, you can always
use LD/Immediate with a negative immediate.

    sv.ld/els *RT, -8(*RA)


or use REMAP to load in element-reversed order but access memory
in positively-incrementing order, by applying REMAP with reverse
to *RT, but not to *RA.

but this is down to the Programmer.

a reversed order messes with Data-Dependent Fail-First.  clearly
these are not the same:


    for i in VL-1 DOWNTO 0      
         GPR(RT+i) <- GPR(RA+i) + EXTS64(SI)
         CR.field[i] = CCode_test(GPR(RT+i))
         if DDFFirst:
             if FAILED(CR.field[i]):
                 VL=i
                 break

vs:

    for i in 0..VL-1
         GPR(RT+i) <- GPR(RA+i) + EXTS64(SI)
         CR.field[i] = CCode_test(GPR(RT+i))
         if DDFFirst:
             if FAILED(CR.field[i]):
                 VL=i
                 break

i don't even want to go there in working through that, rewriting
everything, implementing it, making sure i am happy with it.
just... no.  it's far too much, far too late.


> 
> because that means that the relationship between array indices in
> memory and element numbers in the register file is the identity mapping 

regardless of MSR.LE or MSR.BE,  it already is, Paul.  by definition.
see the Canonical definition in c.
which (reminder) is numbered in LSB0 order (because it's c)

   array index 0 === element index 0 === SVSTATE.srcstep=0
   (this only works in LSB0 numbering)


now, what i *don't* have a problem with is *someone else* doing
an independent Research Project into reverse-element ordering.
one tip i would advise them to consider is, experiment with
(new) MSR bits (or other state), don't overload MSR.LE as it
is specifically associated with Memory-to-Register.

(not even VSX has Register-to-Register element-inversion dependent
 on MSR.LE, Brad made that clear to me a few months back)



> Elements are not unbounded arrays - there are only a finite number of them
> that exist. You don't specify what happens if you run off the end of the
> register file. The architecture needs to specify that.

raises illegal trap for emulation (greater than 128 GPRs or for
Embedded which will only have 32).  will make a note in comment #0

i am sure it is written somewhere. probably the appendix.

 
> The third dot point is not clearly expressed. I think it means that
> element-width overrides cause the register file to be considered as an
> linear array of chunks of that width

yes. think in terms of a byte-addressable SRAM, 64-bit-wide, where
elwidths cause elements to wrap sequentially and contiguously.

(i am getting real fed up of rewording this btw.)


> (but the register number specified in
> the instruction is still interpreted in 64-bit units, right?).

yes.

    uint64_t GPR[128]; "the 64-bit units"
    uint8_t  ew8[]  = (uint8_t*)&(GPR[RT]); linear array of 8bit chunks
    uint16_t ew16[] = (uint16_t*)&(GPR[RT]); linear 16...
    uint32_t ew32[] = (uint33_t*)&(GPR[RT]);
    uint64_t ew64[] = (uint64_t*)&(GPR[RT]); ... default 64 bit




> 2nd & 3rd paragraphs: in ANSI C, are you sure that indexing beyond the
> bounds of a union is defined behaviour? 

last time i tried it, it worked perfectly. now, linux kernel
had all unbounded arrays *removed* because llvm was bitching,
so it *may* only be supported by gcc.

suggestions on appropriate syntax appreciated.
Comment 16 Luke Kenneth Casson Leighton 2023-05-25 18:30:10 BST
(In reply to Paul Mackerras from comment #8)
> "Definition of Strict Program Order" section:

> not require references to other architectures. Also, the Power ISA usually
> uses the term "Current Instruction Address" rather than "Program Counter",
> so the term "Sub Program Counter" should be replaced by "Current Iteration
> Counter" or similar.

i buy that. hmm are there any places where CI-SVSTATE needed to
know about NI-SVSTATE?

 
> I think what you are trying to say here is that (a) Simple-V defines a
> specific order to the iterations it performs,

yes.

> (b) the iterations must appear
> to the program to be executed in that order,

yes.

> and (c) the Current Iteration
> Counter is exposed in the SVSTATE SPR, allowing interrupts to be taken
> within the sequence of iterations. 

yes.  this is more obvious in Vertical-First Mode because SVSTATE.srcstep
and dststep etc. do not change (until executing svstep).


> (Point (c) may be Book III material in fact

i don't mind, if that's where it's best put.


> -- what happens if a program explicitly sets the Current Iteration
> Counter in SVSTATE to a non-zero value before executing an SVP64
> instruction?)

this is exactly what a Context-Switch Handler or a function call does!
therefore i had to both define it *and* then implement unit tests,
checking it.

answer: Sub-execution continues from whatever the CPU reads and interprets
from SVSTATE!

if SVSTATE.srcstep=1 then element 0 is *NOT EXECUTED*, because you
just very clearly requested execution to begin from element **1**.

it really is a loop, but implemented as a preemptive interruptible
Finite State Machine.  the ISACaller Simulator is hell to understand.
only 1000s of unit tests are holding it together.
Comment 17 Luke Kenneth Casson Leighton 2023-05-28 12:29:37 BST
(In reply to Paul Mackerras from comment #4)

> As to the unchanged
> behaviour, it is not clear why that would be so crucially important even if
> it were true.

"catstrophic", not "crucial"

https://youtu.be/NKhr3NVevtc

Variable-Length Identification becomes so complex that it catastrophically
damages the product viability of Multi-Issue implementations of the ISA.

two other ISAs have Variable-Length Encoding that is similarly
crippled: Intex x86 and Motorola's 68000.

Intel only by having vast financial resources was able to develop
**SPECULATIVE** Parallel CISC Decoding that after many many cycles
finally allowed the length to be indentified, such that the speculative
undesired decodes could at last be dropped, and only then passed
in to Register&Memory Hazard Management.

As this was so awful Intel then was forced to further implement
realtime JIT translation to internal RISC Micro-code plus cacheing
lookups just to avoid this situation.


> 5th paragraph: You really can't prohibit future architects from doing
> anything. If nothing else, they could just remove this paragraph. At most
> you can say that doing certain things in future is likely to cause
> difficulties for implementators of future versions of the architecture.

nowhere near strong enough.  "result in the entire ISA being eliminated
from product consideration" would be more accurate.  look at the
complexity above. that's what would need to be done.  no "beginner"
team would bother on learning the full details of how to achieve
competitive performance.

> (You
> can't really even be sure about that, because future implementers might
> invent amazing new implementation techniques that would easily overcome any
> difficulty you propose.)

no, they won't.  complexifying length identification is fundamental.
Decode is one of the defining characteristics of an ISA as to what
products it ends up in.


> This paragraph is a bit like Parliament passing a law saying that Parliament
> may not in future pass a law allowing a particular thing. It's futile and
> ineffective.

then the *understanding* is required alongside suitably transparent
(i.e. not in IBM-Confidential files) supporting documentation as
to why it would be catastrophic.
Comment 18 Luke Kenneth Casson Leighton 2023-05-28 13:44:12 BST
(In reply to Luke Kenneth Casson Leighton from comment #15)

> a reversed order messes with Data-Dependent Fail-First.  clearly
> these are not the same:
> 
> 
>     for i in VL-1 DOWNTO 0      
>          GPR(RT+i) <- GPR(RA+i) + EXTS64(SI)
>          CR.field[i] = CCode_test(GPR(RT+i))
>          if DDFFirst:
>              if FAILED(CR.field[i]):
>                  VL=i
>                  break
> 
> vs:
> 
>     for i in 0..VL-1
>          GPR(RT+i) <- GPR(RA+i) + EXTS64(SI)
>          CR.field[i] = CCode_test(GPR(RT+i))
>          if DDFFirst:
>              if FAILED(CR.field[i]):
>                  VL=i
>                  break

correction: this is much worse than it seems, starting here:
 
   for i in 0..VL-1
          # WRONG? reversed_i = VL-1-i
          reversed_i = MAXVL-1-i # right?? maybe??
          GPR(RT+i) <- GPR(RA+reversed_i) + EXTS64(SI)

but now DD-FFirst is catastrophically damaged.  elements are no longer
numbered independent of the Vector Length.

LSB0 numbering, if MAXVL=8,VL=8:

    Element numbering : 7 6 5 4 3 2 1 0
    Execution order   : 7 6 5 4 3 2 1 0

LSB0 numbering, if MAXVL=8,VL=5:

    Element numbering : 7 6 5 4 3 2 1 0
    Execution order   : . . . 4 3 2 1 0

BE "meaning", LSB0 numbering, if MAXVL=8,VL=8:

    Element numbering : 7 6 5 4 3 2 1 0
    Execution order   : 0 1 2 3 4 5 6 7

BE "meaning", LSB0 numbering, if MAXVL=8,VL=5 if reversed_i=VL-i-1
    Element numbering : 7 6 5 4 3 2 1 0
    Execution order   : . . . 0 1 2 3 4

or if reversed_i=MAXVL-i-1

    Element numbering : 7 6 5 4 3 2 1 0
    Execution order   : 0 1 2 3 4 . . .

(either way the lack of identity-relationship between
Element-numbering and Execution order is a serious comprehension
barrier)

now bring DD-FFirst into play.  elements were tested in DESCENDING
order starting from the HIGHEST (LSB0) numbered element. VL was
truncated, now you have no idea where to start from because all
element positions have changed!

"fixing" that requires:

    for i in MAXVL-1 DOWNTO (MAXVL-VL):

at which point i am unable to cope.

all the developer has to do is load data using appropriate
Structure Packing, by deploying REMAP.  and to be honest if
data is that Dimensionally-challenging i don't think LE/BE
element ordering will help.
Comment 19 Luke Kenneth Casson Leighton 2023-05-29 13:03:54 BST
Created attachment 190 [details]
vle book 3.18.1 spr handling
Comment 20 Luke Kenneth Casson Leighton 2023-05-29 13:19:27 BST
(In reply to Paul Mackerras from comment #6)

> 
> Also, the requirement for a trap when writing ones to reserved bits in SPRs
> is problematic. At the very least you need to exempt privileged code from
> this requirement (to avoid taking traps in context-switch code, where it is
> often quite difficult to handle traps sanely). 

on the basis that supervisor code (fragile as this is) could end up
in recursive trap hell, which is really important to avoid

https://www.st.com/resource/en/reference_manual/rm0004-programmers-reference-manual-for-book-e-processors-stmicroelectronics.pdf

by chance i encountered some relevant wording on this in VLE Book 3.18.1

Invalid SPR references

System behavior when an invalid SPR is referenced depends on the privilege level.

* If the invalid SPR is accessible in user mode (SPR[5] = 0), an illegal instruction exception is taken.
* If the invalid SPR is accessible only in supervisor mode (SPR[5] = 1) and the core complex is in supervisor mode (MSR[PR] = 0), the results of the attempted access are boundedly undefined. 
* If the invalid SPR address is accessible only in supervisor mode (bit 5 of an SPR number = 1) and the core complex is not in supervisor mode (MSR[PR] = 1), a privilege exception is taken. These results are summarized in Table 58. 

       Table 58. System response to an invalid spr reference SPR address bit 5MSR[PR]Response 0 (User)xIllegal exception 1 (Supervisor) 0 (Supervisor)Boundedly undefined 1 (User)Privilege exception

> I would strongly suggest
> specifying that the HEAI is to be taken when executing the next SVP64
> instruction, rather than on the mtspr. (If you do insist on it being taken
> on the mtspr then you will need to add a case to Book III section 7.5.18.)

no i agree the supervisor is off-limits (unless hypervisor could
handle it and even then the problem is simply "moved to HV" sigh)

for context: the Looping system *really is* an independent concept
critically and orthogonally relying on *Scalar* instructions only.
if there is anything in the spec that appears to imply otherwise then
it should immediately be assumed to be wrong, and i need to know where
so i can correct it.

therefore the guiding principle should be taken that the only changes
should be those that minimally get SimpleV Looping into the ISA,
but not at the expense of crippling it and its future interoperability
and expandability.
Comment 21 Luke Kenneth Casson Leighton 2023-05-29 13:52:06 BST
(In reply to Paul Mackerras from comment #8)

> Saying "Simple-V has been carefully designed" sounds self-congratulatory.

mmm... ah, i know why it seems that way.  the entire writing style
of the spec is "third person impersonal".  even hinting of the
*existence* of a person or persons who put thought into the wording
is anomalous.

a style guide document is very important here.

> I
> think you're just saying that all of the state needed to resume a sequence
> of iterations after an interruption at any point is available in the SVSTATE
> SPR. Is there more to it than that?

aside from REMAP, no there isn't! (it's taken literally years to think
that through).
 
> You say "and the four REMAP SPRs if in use at the time". How is an interrupt
> handler to know whether the REMAP SPRs are in use?
> 
> Saying "Whilst this initially sounds unsafe ..." 

interesting. turning uncertainty into certainty. again another one for
the style guide.

> assumes a certain ignorance
> on the part of the reader which may not be justified.

again it breaks the "third person impersonal" paradigm/style. fascinating.

>  A Programming Note
> that says something along the lines of "Interrupt handlers and function
> prologs should generally avoid using SVP64 instructions until after Simple-V
> architected state has been saved to memory", with the possible addition of
> "just as they avoid using Floating-Point, Vector or VSX instructions until
> after the associated state has been saved to memory".

which is direct, clear, and most importantly *entirely third person*.
 
> Saying "which is very rare for Vector ISAs" is at risk of becoming dated,
> and doesn't help our understanding of Simple-V itself.

again it's an "observation" which in turn is an "opinion" which implies
"a person" which is not third-person-impersonal.  i am spotting the
general theme here at last (hooray).
Comment 22 Paul Mackerras 2023-05-30 00:14:52 BST
(In reply to Luke Kenneth Casson Leighton from comment #12)

> it is *always* a Defined-word-instruction, where that DWI **MUST**
> have the exact same definition as if it had no prefix at all
> (caveat: bar the niggles on elwidth).

What about saturating arithmetic?

Could the instruction in fact do nothing, because of predication?
Comment 23 Paul Mackerras 2023-05-30 01:17:28 BST
(In reply to Luke Kenneth Casson Leighton from comment #15)
> (In reply to Paul Mackerras from comment #9)
> > "Register files, elements, and Element-width Overrides" section:
> > 
> > I strongly disagree that the register file should be accessed in
> > little-endian byte order when the processor is in big-endian mode. Requiring
> > that will make Simple-V practically unusable in big-endian mode (just as
> > saying that the register file has to be big-endian always would make
> > Simple-V unusable in little-endian mode).
> 
> (i am assuming you are referring to register-to-register operations
>  which in and of itself is a massive "ask")
> 
> this in effect is equivalent to asking for the PC to run backwards,
> because of the sub-loop running 0..VL-1 running backwards (VL-1 downto 0)

This is not at all what I meant. The sub-loop would still run 0..VL-1.

> but here's the thing: if you have a src and a destination reg and
> you go in reverse, on both, it makes no difference in most circumstances!
> 
>     for i in 0..VL-1
>          GPR(RT+i) <- GPR(RA+i) + EXTS64(SI)
> 
> is no different from:
> 
>     for i in VL-1 DOWNTO 0
>          GPR(RT+i) <- GPR(RA+i) + EXTS64(SI)
> 
> as long as there is no overlap on RT..RT+VL-1 with RA..RA+VL-1
> then back-end Hardware will actually not care *at all*.
> 
> what you are MUCH more likely to be expecting is:
> 
>     for i in 0..VL-1
>          reversed_i = VL-1-i
>          GPR(RT+i) <- GPR(RA+reversed_i) + EXTS64(SI)

No, that's not what I meant. If the element width is 64 bits then there would be no difference at all between BE and LE.

> and the first most crucial thing is, how the hell is that even possible
> to express when there are no spare bits in the Prefix?

It doesn't need any bits in the prefix.

> we went through this exercise over 2 years ago, it was so complex i
> had to put my foot down and say NO.  instead going with REMAP to
> perform these types of "optional inversions".

I would argue it's not an inversion and it's not optional.

> if you really need to load in backwards order, you can always
> use LD/Immediate with a negative immediate.
> 
>     sv.ld/els *RT, -8(*RA)
>
> or use REMAP to load in element-reversed order but access memory
> in positively-incrementing order, by applying REMAP with reverse
> to *RT, but not to *RA.
> 
> but this is down to the Programmer.
> 
> a reversed order messes with Data-Dependent Fail-First.  clearly
> these are not the same:
> 
> 
>     for i in VL-1 DOWNTO 0      
>          GPR(RT+i) <- GPR(RA+i) + EXTS64(SI)
>          CR.field[i] = CCode_test(GPR(RT+i))
>          if DDFFirst:
>              if FAILED(CR.field[i]):
>                  VL=i
>                  break
> 
> vs:
> 
>     for i in 0..VL-1
>          GPR(RT+i) <- GPR(RA+i) + EXTS64(SI)
>          CR.field[i] = CCode_test(GPR(RT+i))
>          if DDFFirst:
>              if FAILED(CR.field[i]):
>                  VL=i
>                  break
> 
> i don't even want to go there in working through that, rewriting
> everything, implementing it, making sure i am happy with it.
> just... no.  it's far too much, far too late.

None of that is what I was suggesting. What you have written above seems to imply that you regard BE as "backwards". My suggestion was not about doing anything backwards or in reverse order. All your loops still go in the same order.

The only difference is the numbering of elements within a register; element 0 in a register would be the left-most element rather than the right-most.

> > because that means that the relationship between array indices in
> > memory and element numbers in the register file is the identity mapping 
> 
> regardless of MSR.LE or MSR.BE,  it already is, Paul.  by definition.
> see the Canonical definition in c.

If I have an array of (say) four 16-bit quantities in memory, and I load that into a register using an ld instruction, then in BE mode, array index 0 ends up in the most significant 16 bits of the register, and array index 3 ends up in the least significant 16 bits. You are insisting that the least significant 16 bits are element 0 from the point of view of the SV iterations; so now if I use VL=2 I end up working on array indices 3 and 2 rather than 0 and 1.

> which (reminder) is numbered in LSB0 order (because it's c)

No, C does not inherently assume LSB0 order.

>    array index 0 === element index 0 === SVSTATE.srcstep=0
>    (this only works in LSB0 numbering)
> 
> 
> now, what i *don't* have a problem with is *someone else* doing
> an independent Research Project into reverse-element ordering.
> one tip i would advise them to consider is, experiment with
> (new) MSR bits (or other state), don't overload MSR.LE as it
> is specifically associated with Memory-to-Register.
> 
> (not even VSX has Register-to-Register element-inversion dependent
>  on MSR.LE, Brad made that clear to me a few months back)

No it doesn't; and I am not suggesting element *inversion* either.

At this point, I will concede my failure to make my suggestion sufficiently clear, and drop it as being not worth the effort to persist with. I still think it is the correct approach, though.

[snip]

> > 2nd & 3rd paragraphs: in ANSI C, are you sure that indexing beyond the
> > bounds of a union is defined behaviour? 
> 
> last time i tried it, it worked perfectly. now, linux kernel
> had all unbounded arrays *removed* because llvm was bitching,
> so it *may* only be supported by gcc.

Doing something that is not defined behaviour can and often does work "perfectly" on some (or even most) implementations. I was asking a question about what the C standard says, not what gcc or llvm or any other implementation actually does.
Comment 24 Paul Mackerras 2023-05-30 08:16:16 BST
(In reply to Luke Kenneth Casson Leighton from comment #12)
> (In reply to Paul Mackerras from comment #5)
> > "SVP64 encoding features" section:
> >
> > At this point, what readers need to understand is the following:
> > 
> > * SVP64 instructions are always prefixed (64 bit) instructions.
> 
> there aren't any SVP64 instructions. there *really is* only
> "32-bit Prefix followed by any Scalar 32-bit instruction".

The ISA as it stands has a property which is extremely useful, which is that with a couple of rare exceptions (see below), it is possible to analyse an instruction word (or doubleword) and know which CPU registers it is going to read and write, without knowing anything about the architected state of the CPU. This simplifies the job of anything that wants to translate or emulate instructions, or generally understand what the effect of a block of code could be or the dependencies between instructions. Examples include valgrind, qemu, gdb, etc.

[The exceptions are the lswx and stswx instructions, which use a byte count in the XER. The byte count controls the number of GPRs read or written. But modern compilers don't use lswx or stswx, and they always cause an alignment interrupt in LE mode.]

If a side effect of adopting Simple-V is that this property no longer holds, then that is a serious problem in my view. If it is the case that a 32-bit instruction without any prefix could in some cases access different registers from those identified in the instruction word, depending on the state of the CPU (for example, depending on what is in the SVSTATE SPR), then you have broken this property.

I had thought there would be a clear and simple way to tell which instructions would be affected by vectorization (i.e., the presence of a SVP64 prefix). But it sounds like that is not true, unfortunately.
Comment 25 Paul Mackerras 2023-05-30 08:34:30 BST
(In reply to Luke Kenneth Casson Leighton from comment #12)
> (In reply to Paul Mackerras from comment #5)
> > "SVP64 encoding features" section:
> >
> > At this point, what readers need to understand is the following:
> > 
> > * SVP64 instructions are always prefixed (64 bit) instructions.
> 
> there aren't any SVP64 instructions. there *really is* only
> "32-bit Prefix followed by any Scalar 32-bit instruction".
> 
> it is absolutely prohibited to Prefix an UNDEFINED 32-bit word,
> and it is absolutely prohibited to assume that just because
> any given UNDEFINED 32-bit word has a prefix in front of it,
> it can be ALLOCATED a new meaning.

In all of that, I take it that you are talking about the cases where bit 6 of the SVP64 prefix is 1, or there is no SVP64 prefix. A SVP64 prefix with bit 6 equal to 0 does exactly what you are saying is absolutely prohibited, doesn't it? In that case the suffix *is* interpreted quite differently from any meaning it might have without the prefix.

> i go over this in more detail here:
>     https://libre-soc.org/openpower/sv/rfc/ls001/
>     "Example Legal Encodings and RESERVED spaces"
> 
> > * The prefix word controls various aspects of the vectorisation, and is
> > defined below in this section.
> > * The suffix word is either a defined word instruction,
> 
> it is *always* a Defined-word-instruction, where that DWI **MUST**
> have the exact same definition as if it had no prefix at all
> (caveat: bar the niggles on elwidth).

I think it's not just element width, it's also the possiblity of doing multiple operations, and potentially not doing some or all of them, leaving the corresponding parts of the destination register unchanged.

> i.e.: all operands MUST NOT change meaning, just because of prefixing
> 
> 
> 
> let us take addi as an example (because of paddi)
> See Public v3.1 I 3.3.9 p76
> 
> here is the operands:
> 
> addi:
> 
>     DWI   : PO=14  RT RA SI
> 
> paddi: 
> 
>     Prefix: PO=1   R     si0
>     Suffix: PO=14  RT RA si1
> 
> here is the RTL:
> 
>     if “addi” then
>         RT ← (RA|0) + EXTS64(SI)
>     if “paddi” & R=0 then
>         RT ← (RA|0) + EXTS64(si0||si1)
>     if “paddi” & R=1 then
>         RT ← CIA + EXTS64(si0||si1)
> 
> 
> what ABSOLUTELY MUST NOT HAPPEN is:
> 
>     else if “sv.addi” then
>         RT ← 55 * (RA|0) + EXTS64(SI) << 5 + 99
> 
> and even more insane and damaging:
> 
>     else if “sv.addi” then # actually redefined as multiply
>         RT ← (RA|0) * EXTS64(SOME_TOTALLY_DIFFERENT_OPERAND)
> 
> it should not really even be necessary to put this into the addi spec:
> 
>     SVPrefix: PO=9  (24-bit RM)
>     Suffix  : PO=14 (RT RA SI)
> 
> why not? *because there is no change*: it is prohibited!
> 
> think about it.  if it was allowed, then even the spec goes
> haywire.  pages and pages of:
> 
>     addi:      {PO14}     DWI addi
>        DWI   : PO=14 (RT RA SI)
> 
>     paddi:            uses {PO1-PO14} but still "addi" (in effect)
>        Prefix: PO=1  ( R si0)
>        Suffix: PO=14 (RT RA si1)
> 
>     sv.somethingelseentirely  {PO9-PO14} conflicts with addi?? WTF????
>        SVPrefix: PO=9  (then 24-bit RM)
>        Suffix  : PO=14 (RT RA differentoperand)     XYZ99-Form
> 
> 
> total nightmare just from a spec-writing perspective,
> as well as damaging Decode.
> 
> the only things you are "allowed" to do is to extend the reg numbers.
> i do appreciate that given that PO1 does in fact "change" immediate
> operands (si0 || si1), flipping from using SI to instead using
> split-field si0-si1, it seems like i am freaking out, but i really
> *really* don't want to hit the Power ISA Spec with 200+ changes
> to the RTL and instruction definitions.

Yeah, neither do I. But the effects of vectorization do have to be completely and accurately described somewhere.

> if you really really are asking for split fields to be introduced
> for RT, RA, RS, RB, BA, BFA, FRS, FRC, BT (basically everything)
> then i feel the entire suite - over 200 {PO9-DWI}s - should be
> autogenerated.

Sorry, I don't get why you're talking about split fields here. I don't recall mentioning split fields in this discussion.
Comment 26 Paul Mackerras 2023-05-30 08:58:16 BST
(In reply to Luke Kenneth Casson Leighton from comment #14)
> (In reply to Paul Mackerras from comment #4)
> 
> > In this and subsequent paragraphs, you keep hammering on the idea that the
> > underlying scalar instruction is (a) a word instruction 
> 
> this is mandatory and inviolate, yes.
> 
> > and (b) completely unchanged in its behaviour.
> 
> except for elwidth-overrides, yes.
> 
> it goes like this:
> 
>   for i in range(VL):
>       {DO_SCALAR_DEFINED_WORD_INSTRUCTION}
> 
> where predication would be:
> 
>   for i in range(VL):
>       if predicate_mask[i]:
>           {DO_SCALAR_DEFINED_WORD_INSTRUCTION}
> 
> the instruction *hasn't* changed - not in the slightest bit - just because it
> is predicated.
> 
> > Neither is in fact true. As to the unchanged
> > behaviour, it is not clear why that would be so crucially important even if
> > it were true.
> 
> dramatic simplification of Decode Phase.

At this point what we need explained is *what it is*. Things such as why it is good for it to be the way it is, or what other possible things you could think of that would be much much worse, are secondary considerations here. Once we understand what it is, then we might be able to understand those other things (or we might not care, if we are just trying to program the thing and don't know or care what a decode phase is, and still less have any ambition to extend the ISA in any direction).

> > Instead it would be far more helpful to put a complete list of the ways that
> > the behaviour of the underlying instruction is (or can be) changed.
> 
> strictly-speaking: element-width overrides, and that's it.
> that's the *only* change, and that's covered by ls005.xlen.
> 
>   for i in range(SVSTATE.VL):
>       if PREFIX.elwidth == 8:
>           {DO_SCALAR_DEFINED_WORD_INSTRUCTION... BUT @ XLEN=8-bit}
>       else if PREFIX.elwidth == 16:
>           {DO_SCALAR_DEFINED_WORD_INSTRUCTION... BUT @ XLEN=16-bit}
>       else if PREFIX.elwidth == 32:
>           {DO_SCALAR_DEFINED_WORD_INSTRUCTION... BUT @ XLEN=32-bit}
>       else if PREFIX.elwidth == 64:
>           {DO_SCALAR_DEFINED_WORD_INSTRUCTION... BUT @ XLEN=64-bit}
> 
> that *still* does not imply that the actual Defined Word Instruction
> itself "changed" [became a multiply instead of an add, had access
> to other operands, etc.]

Irrespective of what you would consider a "change" to the instruction, the reader does need to know at this point what could be different in the execution of the instruction compared to the plain instruction, and since this a reference work, it needs to be a complete and accurate list.

The list would include the registers accessed, the element size, which bits of the destination are updated, the nature of the arithmetic operation (if saturating arithmetic can be specified), and whatever else could be different from the plain instruction.

And since this is a reference work and you gave a complete and accurate list, any other changes to execution are implicitly forbidden.

> the looping *REALLY IS* a completely separate concept from the
> {execution-of-the-Defined-Word-Instruction} - it *REALLY IS*
> a Sub-Current-Execution-Counter (CIA.SVSTATE)
> 
> 
> (strictly-speaking not so for Register numbers, which techincally
>  become "split fields" - but i demonstrate here why attempting to
>  put those into the actual RTL - of every single damn instruction -
>  would be absolute hell
>  https://libre-soc.org/openpower/sv/rfc/ls010/hypothetical_addi/ )

ah, is that where the split fields thing came from...
Comment 27 Luke Kenneth Casson Leighton 2023-05-30 11:14:37 BST
(In reply to Paul Mackerras from comment #4)
 
> In the last sentence you introduce a way of viewing this stuff only to then
> say that that isn't a good way to view it. So why introduce it at all?

there are no computer science terms for SV because they haven't been
invented.  i am therefore struggling badly to even introduce the
concepts behind it, let alone specify it in a succinct way.
Comment 28 Luke Kenneth Casson Leighton 2023-05-31 03:27:53 BST
(In reply to Paul Mackerras from comment #9)
> "Register files, elements, and Element-width Overrides" section:
> 
> I strongly disagree that the register file should be accessed in
> little-endian byte order when the processor is in big-endian mode.

just to check (1) register-to-register,
(note deliberate use of "addi" in 3rd step):

* let us set MAXVL=VL=1
* let us also use elwidth=16
* let us set (verilator or "addi 5,0,0x1234") the contents of GPR(5) = 0x1234
  LSB0 63..........0
  MSB0 0..........63
       0000... 12 34
* perform "sv.addi/elwidth=16 5,0,0x1122"
* then inspect (verilator) GPR(5) and read its contents

is the answer you expect, regardless of LE/BE: 0x2356?
or would it be 
* 0x2211_0000_0000_1234 (or 0x1122_0000_0000_1234) *or*
* 0x0000_0000_0000_3456 due to addi being implicitly
  reversed-byte-order from sv.addi under BE?

now the same thing with *scalar* instructions:

* let us set (verilator or "addi 5,0,0x1234") the contents of GPR(5) = 0x1234
* perform "addi 5,0,0x1122"
* then inspect (verilator) GPR(5) and read its contents

is it *still* 0x23567 regardless of LE/BE?


checking (2) memory-to-register:

what about the same conditions (MAXVL=VL=1, a half-word load)
with lhbrx vs lhx?

* sv.lhbrx vs lhbrx, BE: same value loaded?
* sv.lhbrx vs lhbrx, LE: same value loaded?



if the answer in all cases (m2r&r2r) is "yes", then this is what i mean
by "instructions must be Orthogonal regardless of Prefix/Non-prefix"

if the answer in all cases is "no", then resisting the pressure
to break Orthogonality, these are some potential options:

* solution (1) is to add *scalar* instructions that perform the BRev
  (and then SV-Prefix those)
* solution (2) is to add *scalar* register-tagging (an SPR that
  marks a given register as "please reverse me on GPR read *and* write"
* solution (3) is to completely redesign the 24-bit SVP64 Prefix
  from scratch, reserving four bits for being able to reverse
  up to 4(!) operands (coping with FMAC)
* solution (4) just use in-place sv.brh *RT, *RA (where RT=RA)
  and go from there
Comment 29 Jacob Lifshay 2023-05-31 03:33:49 BST
(In reply to Luke Kenneth Casson Leighton from comment #28)
> now the same thing with *scalar* instructions:
> 
> * let us set (verilator or "addi 5,0,0x1234") the contents of GPR(5) = 0x1234
> * perform "addi 5,0,0x1122"
> * then inspect (verilator) GPR(5) and read its contents
> 
> is it *still* 0x23567 regardless of LE/BE?

i think you meant 0x2356, no 7
Comment 30 Paul Mackerras 2023-05-31 08:42:09 BST
(In reply to Luke Kenneth Casson Leighton from comment #28)
> (In reply to Paul Mackerras from comment #9)
> > "Register files, elements, and Element-width Overrides" section:
> > 
> > I strongly disagree that the register file should be accessed in
> > little-endian byte order when the processor is in big-endian mode.
> 
> just to check (1) register-to-register,
> (note deliberate use of "addi" in 3rd step):
> 
> * let us set MAXVL=VL=1
> * let us also use elwidth=16
> * let us set (verilator or "addi 5,0,0x1234") the contents of GPR(5) = 0x1234
>   LSB0 63..........0
>   MSB0 0..........63
>        0000... 12 34
> * perform "sv.addi/elwidth=16 5,0,0x1122"

I think you mean sv.addi/elwidth=16 5,5,0x1122 (not 5,_0_,0x1122). I'll assume the 0 for RA is a typo caused by 3.27AM.

> * then inspect (verilator) GPR(5) and read its contents
> 
> is the answer you expect, regardless of LE/BE: 0x2356?
> or would it be 
> * 0x2211_0000_0000_1234 (or 0x1122_0000_0000_1234) *or*
> * 0x0000_0000_0000_3456 due to addi being implicitly
>   reversed-byte-order from sv.addi under BE?

I would expect 0x1122_0000_0000_1234 in BE mode, since you have operated on element 0 and elements are 16 bits wide.

> now the same thing with *scalar* instructions:
> 
> * let us set (verilator or "addi 5,0,0x1234") the contents of GPR(5) = 0x1234
> * perform "addi 5,0,0x1122"
> * then inspect (verilator) GPR(5) and read its contents
> 
> is it *still* 0x23567 regardless of LE/BE?

It's 0x2356 regardless of LE/BE.

If you did sv.addi/elwidth=64 5,5,0x1122 then the answer would be 0x2356 regardless of BE/LE.

> checking (2) memory-to-register:
> 
> what about the same conditions (MAXVL=VL=1, a half-word load)
> with lhbrx vs lhx?
> 
> * sv.lhbrx vs lhbrx, BE: same value loaded?
> * sv.lhbrx vs lhbrx, LE: same value loaded?

What are you assuming the element size is?

I am not clear at this point on how the element size affects loads and stores. Does an element size of 16 bits mean that a load does 1/4 of the usual number of bits, for instance?

If the element size in your example above is 64 bits, then I would expect sv.lhbrx and lhbrx to give the same value in the destination GPR. If the element size is some other value, I don't know what to expect.

> if the answer in all cases (m2r&r2r) is "yes", then this is what i mean
> by "instructions must be Orthogonal regardless of Prefix/Non-prefix"

I'm not sure what "yes" would mean in the addi case above. In any case, I would note that addi will in general give a different result from sv.addi/elwidth=16 in LE mode as well as in BE mode. For example, suppose r5 contains 0xffff initially.

addi 5,5,1 will give 0x10000 in r5
sv.addi/elwidth=16 5,5,1 will give 0 in r5 (assuming VL=1 and LE mode).

> if the answer in all cases is "no", then resisting the pressure
> to break Orthogonality, these are some potential options:
> 
> * solution (1) is to add *scalar* instructions that perform the BRev
>   (and then SV-Prefix those)
> * solution (2) is to add *scalar* register-tagging (an SPR that
>   marks a given register as "please reverse me on GPR read *and* write"
> * solution (3) is to completely redesign the 24-bit SVP64 Prefix
>   from scratch, reserving four bits for being able to reverse
>   up to 4(!) operands (coping with FMAC)
> * solution (4) just use in-place sv.brh *RT, *RA (where RT=RA)
>   and go from there

I don't understand what problem these solutions are trying to solve. None of them seem to me to be necessary or even desirable. You keep introducing byte reversal, which is not ever required by my proposal.

In fact, depending on how elwidth affects loads and stores, there may be another answer to my original concern about loading an array of values into registers. It's possible that doing sv.ld/elwidth=16 r3,0(r4) with VL=4 will load four 16-bit elements into r3 in the right order for future operations, but I don't know for sure.
Comment 31 Jacob Lifshay 2023-05-31 09:22:19 BST
(In reply to Paul Mackerras from comment #30)
> (In reply to Luke Kenneth Casson Leighton from comment #28)
> > checking (2) memory-to-register:
> > 
> > what about the same conditions (MAXVL=VL=1, a half-word load)
> > with lhbrx vs lhx?
> > 
> > * sv.lhbrx vs lhbrx, BE: same value loaded?
> > * sv.lhbrx vs lhbrx, LE: same value loaded?
> 
> What are you assuming the element size is?

i'd assume elwid=16

> I am not clear at this point on how the element size affects loads and
> stores. Does an element size of 16 bits mean that a load does 1/4 of the
> usual number of bits, for instance?

no, memory access sizes are not modified by elwid, so sv.lhz/elwid=16 still loads 16-bits.

> You keep introducing byte
> reversal, which is not ever required by my proposal.

it is required in hardware that supports both endians since the byte reversal hardware is what changes wether little-endian or big-endian element indexing is used, by byte reversing inputs/outputs of operations (or any logically equivalent method that is likely much more efficient).

e.g., assuming your endian proposal with VL=4 r3=0x0123_4567_89ab_cdef
sv.ori/sw=8/dw=16 *r3, *r3, 0
in LE mode produces:
r3=0x0089_00ab_00cd_00ef
in BE mode produces:
r3=0x0001_0023_0045_0067

hardware would be something like (assuming registers and ALUs are in LE):

r3
|
| 0x0123_4567_89ab_cdef
v
rearrange all elements in BE order to LE order (a byteswap)
|
| 0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef aka. 0xefcd_ab89_6745_2301
v
expand elements to 16-bit
|
| 0x0001, 0x0023, 0x0045, 0x0067, 0x0089, 0x00ab, 0x00cd, 0x00ef
| aka. 0x00ef_00cd_00ab_0089_0067_0045_0023_0001
v
4x ori with elwid=16
|
| 0x0001, 0x0023, 0x0045, 0x0067 aka. 0x0067_0045_0023_0001
v
rearrange all elements in LE order to BE order (like a byteswap)
|
| 0x0001_0023_0045_0067
v
r3
Comment 32 Paul Mackerras 2023-05-31 10:14:13 BST
(In reply to Jacob Lifshay from comment #31)
> (In reply to Paul Mackerras from comment #30)
> > (In reply to Luke Kenneth Casson Leighton from comment #28)
> > > checking (2) memory-to-register:
> > > 
> > > what about the same conditions (MAXVL=VL=1, a half-word load)
> > > with lhbrx vs lhx?
> > > 
> > > * sv.lhbrx vs lhbrx, BE: same value loaded?
> > > * sv.lhbrx vs lhbrx, LE: same value loaded?
> > 
> > What are you assuming the element size is?
> 
> i'd assume elwid=16
> 
> > I am not clear at this point on how the element size affects loads and
> > stores. Does an element size of 16 bits mean that a load does 1/4 of the
> > usual number of bits, for instance?
> 
> no, memory access sizes are not modified by elwid, so sv.lhz/elwid=16 still
> loads 16-bits.

OK, thanks for clarifying that.

I think my example (of loading an array of four halfwords) then becomes sv.lhz/elwidth=16 with VL=4. Hopefully that gets optimized to a single 64-bit cache read.
Comment 33 Paul Mackerras 2023-05-31 10:21:40 BST
(In reply to Jacob Lifshay from comment #31)
> 
> it is required in hardware that supports both endians since the byte
> reversal hardware is what changes wether little-endian or big-endian element
> indexing is used, by byte reversing inputs/outputs of operations (or any
> logically equivalent method that is likely much more efficient).
> 
> e.g., assuming your endian proposal with VL=4 r3=0x0123_4567_89ab_cdef
> sv.ori/sw=8/dw=16 *r3, *r3, 0
> in LE mode produces:
> r3=0x0089_00ab_00cd_00ef
> in BE mode produces:
> r3=0x0001_0023_0045_0067

Interesting example. I'll have to think about how I would implement that.

Ignoring BE for the moment, what kind of structure do you have in your design for handling this kind of source/destination width mismatch? Is it something like a bunch of multiplexers ahead of the ALU, or is there a more clever way to do it?

> hardware would be something like (assuming registers and ALUs are in LE):

(Actually, registers and ALUs don't have endianness.)

In any case, I accept that Simple-V is already so incredibly complex that you can't really afford the extra complexity of my idea. And if something like sv.lhz/elwidth=16 gets things into registers in the right order in BE mode (i.e. element-swapped compared with what a plain ld would do) then that answers at least one of my main problems with saying the register file is always LE.
Comment 34 Jacob Lifshay 2023-05-31 10:48:04 BST
(In reply to Paul Mackerras from comment #33)
> Interesting example. I'll have to think about how I would implement that.
> 
> Ignoring BE for the moment, what kind of structure do you have in your
> design for handling this kind of source/destination width mismatch? Is it
> something like a bunch of multiplexers ahead of the ALU, or is there a more
> clever way to do it?

i'll let luke answer that, but basically we can always fall back to element-at-a-time operation if the operation pattern is too complex to have dedicated fast-paths for (this is probably one we'll want fast-paths for). i'm not sure we have this specific part designed fully in hardware yet.

> 
> > hardware would be something like (assuming registers and ALUs are in LE):
> 
> (Actually, registers and ALUs don't have endianness.)

they do here, because you can access them with different element sizes, which give different results depending on how you translate element indexes to parts of a 64-bit word. this is exactly how memory also gains program-visible endian-ness (ignoring how it looks from external hardware), since you can just read the first byte of a 64-bit word and see if it was the msb or lsb byte.

> In any case, I accept that Simple-V is already so incredibly complex that
> you can't really afford the extra complexity of my idea.

yes, we already spent a lot of time going over basically the same idea before and de-facto decided the extra hardware complexity was not worth it both because of the extra cost and because luke couldn't hold it all in his head:
https://bugs.libre-soc.org/show_bug.cgi?id=560

> And if something
> like sv.lhz/elwidth=16 gets things into registers in the right order in BE
> mode (i.e. element-swapped compared with what a plain ld would do) then that
> answers at least one of my main problems with saying the register file is
> always LE.

yes, that's what happens afaict.

also, grev (a subset of what grevlut can do) can do any combination of any number of bit/byte/element swaps in one instruction, so we don't need to go through memory every time.
Comment 35 Luke Kenneth Casson Leighton 2023-05-31 14:13:44 BST
(In reply to Paul Mackerras from comment #30)

> I think you mean sv.addi/elwidth=16 5,5,0x1122 (not 5,_0_,0x1122).

ah! yes

> I'll assume the 0 for RA is a typo caused by 3.27AM.
> 
> > * then inspect (verilator) GPR(5) and read its contents
> > 
> > is the answer you expect, regardless of LE/BE: 0x2356?
> > or would it be 
> > * 0x2211_0000_0000_1234 (or 0x1122_0000_0000_1234) *or*
> > * 0x0000_0000_0000_3456 due to addi being implicitly
> >   reversed-byte-order from sv.addi under BE?
> 
> I would expect 0x1122_0000_0000_1234 in BE mode, since you have operated on
> element 0 and elements are 16 bits wide.

ahhh now *that* makes it clear.  and is so far left-field of what i
was modelling/expecting from the combinatorial explosion of possibilities
that i couldn't possible guess it :)

now, here's the thing (walk through the implications).  where the LE
element-access would be this:

     # assume everything LE-ordered and LSB-numbered
     gpr_width = 8 # bytrs
     num_gprs = 128 # in "upper" SV Compliancy Levels
     GPR_sram = [0x00] * gpr_width * num_gprs
     src_elbytes = src_elwidth // 8
     for i in range(VL):
         bytenum = i * src_elbytes # element offset in SRAM bytes
         ra_element_start = RA*gpr_width     # vector start position
         ra_element_start += bytenum # element offset
         ra_element_end   = ra_element_end + (src_elbytes-1)
         ra_src_operand = GPR_sram[ra_element_start thru ra_element_end]

a BE-reversal of the underlying SRAM-access would be:

     # *still* assume everything LE-ordered and LSB-numbered
     gpr_width = 8 # bytrs
     num_gprs = 128 # in "upper" SV Compliancy Levels
     GPR_sram = [0x00] * gpr_width * num_gprs
     src_elbytes = src_elwidth // 8
     for i in range(VL):
         offset = i * src_elbytes           # element offset in SRAM bytes
         gpr_num = offset // gpr_width      # relative GPR number  
         bytenum = offset %  gpr_width      # byte-start in GPR
---->    bytenum = ~bytenum & 0b1111_1111   # BE-inversion
         # now finally we know the element-offset start pos
         ra_element_start = (gpr_num * gpr_width) + bytenum
         ra_element_start += RA*gpr_width     # add vector start position
         ra_element_end   = ra_element_end + (src_elbytes-1)         
         ra_src_operand = GPR_sram[ra_element_start thru ra_element_end]


at which point i think you'd agree that trying to explain that to
programmers, that this is the underlying model, would be a bit much :)


> > now the same thing with *scalar* instructions:
> > 
> > * let us set (verilator or "addi 5,0,0x1234") the contents of GPR(5) = 0x1234
> > * perform "addi 5,0,0x1122"
> > * then inspect (verilator) GPR(5) and read its contents
> > 
> > is it *still* 0x23567 regardless of LE/BE?
> 
> It's 0x2356 regardless of LE/BE.

and that discrepancy is a violation of (one of the) Orthogonality rule(s).
when MAXVL=VL=1 the behaviour *has* to be the same (elwidth
notwithstanding)

let us imagine that a programmer is converting Scalar Power Assembler
to SVP64.  they are doing so on a BE system.  assume that
GPR(5) starts out with a value 0x 0000_1144_5566_7789 they do this:

     # old code
     addi 5,0,0x1122
     addis 5,5,0x3344
     # new code
     setvli MAXVL=VL=1
     sv.addi/elwidth=16 5,0,0x1122
     sv.addis/elwidth=32 5,5,0x3344

and then they inspect the contents of GPR(5) and find that it's not
0x0000_0000_3344_1122 which you'd get from running the two scalar
instructions, it's... this may not be correct...

    after the sv.addi/ew=16    0x1122_1144_5566_7789
    after the sv.addis/ew=32   0x4466_1144_5566_7789

!!!!! :)

they then run that in LE and get this:

     0x0000_1144_5566_7789 +
       0000 0000 0000 1122 +
       0000 0000 3344 0000

=      0000 1144 88aa 88ab

at which point their brains explode.

unpacking what the hell happened there (LE):

* sv.addi/ew=16 sets *two* byte-write-enable lines on GPR(5)
  leaving the entire upper 6 bytes *untouched*
* sv.addis/ew=32 sets the bottom *4* byte-write-enable lines
  leaving the entire upper 4 bytes untouched.


there is mad interaction between BE-offsets because the starting-point
for *elements within a given GPR* are critically dependent on the
operation width, and inversion of those starting-points becomes a
really crucial thing for the programmer to understand.


 
> If you did sv.addi/elwidth=64 5,5,0x1122 then the answer would be 0x2356
> regardless of BE/LE.

which means unfortunately that if you had a vector of elements to
add where you know the result fits in 16 bits (Audio/Video) then 3/4
of the regfile is unused.  which is the whole reason why SIMD was
invented.


now, REMAP *can* actually handle this type of element-reordering.
in effect what you are proposing is:

* BE ew=8, element ordering:
       MSB0          MSB63
       LSB63         LSB0
  GPR0 7 6 5 4 3 2 1 0
  GPR1 ........... 9 8

* BE ew=16, element ordering:
       MSB0    MSB63
       LSB63   LSB0
  GPR0   3  2  1  0
  GPR1   7  6  5  4
  GPR2 .....   9  8

* BE ew=32, element ordering:
       MSB0    MSB63
       LSB63   LSB0
  GPR0      1     0
  GPR1      3     2
  GPR2      5     4
  ...

and looking at those sequences, svshape2 can handle them each:

* BE ew=8     svshape2 xdim=8, xinv=yes
* BE ew=16    svshape2 xdim=4, xinv=yes
* BE ew=32    svshape2 xdim=2, xinv=yes

and you can also specify *which operands should be so re-ordered* (!!!)

as in: if you wanted to you could set RA and RB to be BE-reordered,
but leave RT in *LE*-reordered numbering (!!!)

is it necessary to have different svshape2 instructions to get reordering
depending on different elwidths? yes.  do i see this as a problem?
mmmm... honestly, no.


> > checking (2) memory-to-register:
> > 
> > what about the same conditions (MAXVL=VL=1, a half-word load)
> > with lhbrx vs lhx?
> > 
> > * sv.lhbrx vs lhbrx, BE: same value loaded?
> > * sv.lhbrx vs lhbrx, LE: same value loaded?
> 
> What are you assuming the element size is?

sigh, it used to be over-rideable up until about 2 weeks ago.
that's the way it was for 18 months.

but finally sanity asserted itself and the *data* elwidth is now
always the same as the *operation* width.

* lh ew=16
* lb ew=8
* lw ew=32
* ld ew=64

(note that there is still elwidth over-rides on the *Effective Address*
 calculation for LD/ST-Indexed "RB". i am currently wading through a
 really intrusive slightly scary spec update on that).


> I am not clear at this point on how the element size affects loads and
> stores. Does an element size of 16 bits mean that a load does 1/4 of the
> usual number of bits, for instance?

sv.ld/ew=16 64//16=4?

no, i decided for sanity to preserve the relationship "elwidth=opwidth".
loading only 1 bit (sv.lb/ew=8) would be a step too far i feel.


> > if the answer in all cases (m2r&r2r) is "yes", then this is what i mean
> > by "instructions must be Orthogonal regardless of Prefix/Non-prefix"
> 
> I'm not sure what "yes" would mean in the addi case above.

hence i went through the example.


>  In any case, I
> would note that addi will in general give a different result from
> sv.addi/elwidth=16 in LE mode as well as in BE mode. For example, suppose r5
> contains 0xffff initially.
> 
> addi 5,5,1 will give 0x10000 in r5
> sv.addi/elwidth=16 5,5,1 will give 0 in r5 (assuming VL=1 and LE mode).

yes it will! more if r5 contained 0xffff_ffff_ffff_ffff then it
would be 0x0000_0000_0000_0000 in r5 after addi 5,5,1 but after
sv.addi/elwidth=16 5,5,1 it woud be 0xffff_ffff_ffff_0000

"sv.addi." (Rc=1) gets interesting, too. another time.


and.. drat there is no "addio" darnit.  "sv.addio/ew=16" would have
dropped the 17th bit into XER.CA

that's slightly annoying but not the end of the world.


> I don't understand what problem these solutions are trying to solve. None of
> them seem to me to be necessary or even desirable. You keep introducing byte
> reversal, which is not ever required by my proposal.

i didn't understand it fully up to now.  the "0x1122_0000_000_3344"
finally clinched it.


 
> In fact, depending on how elwidth affects loads and stores, there may be
> another answer to my original concern about loading an array of values into
> registers. It's possible that doing sv.ld/elwidth=16 r3,0(r4) with VL=4 will
> load four 16-bit elements into r3 in the right order for future operations,
> but I don't know for sure.

yes. Packed Elements. very similar to MMX.   wait... immediate-of-zero,
that is a special meaning (Vector LD/ST is always complex, no matter the
ISA, but retro-fitting on top of Scalar LD/ST made things especially
hairy).

https://libre-soc.org/openpower/sv/ldst/

    The els bit is only relevant when RA.isvec is clear: this
    indicates whether stride is unit or element:

    if RA.isvec:
        svctx.ldstmode = indexed
    elif els == 0:
        svctx.ldstmode = unitstride
    elif immediate != 0:
        svctx.ldstmode = elementstride

and the relevant pseudocode:

        elif svctx.ldstmode == elementstride:
          # element stride mode
          srcbase = ireg[RA]
          offs = i * immed              # we want this one
        elif svctx.ldstmode == unitstride:
          # unit stride mode
          srcbase = ireg[RA]
          offs = immed + (i * op_width)  # we don't want this one

so, to match the english-language words you use with the assembler,
you wanted:

    sv.lh/ew=16/els r3,16(r4) 

which will load QTY4 16-bit contiguous elements starting at r4,
and drop them (also contiguously) into r3.

the original assembler you used:

    sv.ld/ew=16 r3,0(r4)  

will load *64-bit* quantities, TRUNCATE them to 16-bit, and drop the
TRUNCATED elements contiguously into r3.

(removing saturation which used to be in the LD/ST spec for 18+
 months was last week's major-scary-edit, and that is down to
there being no Scalar "ld."  Rc=1 is the only way you could
activate notification (CRfield.SO) as to whether saturation
occurred)
Comment 36 Jacob Lifshay 2023-05-31 15:23:56 BST
(In reply to Luke Kenneth Casson Leighton from comment #35)
> (In reply to Paul Mackerras from comment #30)
> > In fact, depending on how elwidth affects loads and stores, there may be
> > another answer to my original concern about loading an array of values into
> > registers. It's possible that doing sv.ld/elwidth=16 r3,0(r4) with VL=4 will
> > load four 16-bit elements into r3 in the right order for future operations,
> > but I don't know for sure.

i missed that that is ld and not lhz. if it's sv.lhz/elwid=16 *r3, 0(r4) with VL=4 then that loads a contiguous array of 4 16-bit elements in LE/BE as appropriate into r3 in LE order as needed for svp64. e.g. in BE mode:
mem at *r4:
0x01 0x23 0x45 0x67 0x89 0xab 0xcd 0xef
result in r3 (in LE, since that's always what svp64 uses for registers):
0xcdef_89ab_4567_0123

because it treats it as loading uint16_t[4].

the load uses unitstride mode which is afaict what we want here.

>         elif svctx.ldstmode == elementstride:
>           # element stride mode
>           srcbase = ireg[RA]
>           offs = i * immed              # we want this one

no, afaict.
>         elif svctx.ldstmode == unitstride:
>           # unit stride mode
>           srcbase = ireg[RA]
>           offs = immed + (i * op_width)  # we don't want this one

we *do* want this one afaict.
> 
> so, to match the english-language words you use with the assembler,
> you wanted:
> 
>     sv.lh/ew=16/els r3,16(r4)
> 
> which will load QTY4 16-bit contiguous elements starting at r4,
> and drop them (also contiguously) into r3.

no it won't, it will load a 16-bit value every 16 *bytes* starting at r4.

lkcl, you were probably thinking of sv.lhz/elwid=16/els *r3, 2(r4) since u16 is 2 bytes
Comment 37 Luke Kenneth Casson Leighton 2023-05-31 16:15:10 BST
> (in LE, since that's always what svp64 uses for registers):

it's way more than that: it's what the *Power ISA* uses for
registers (all of them).  it took several hours review of the
spec on VSX (pretty much every instruction) to *deduce* that
arithmetic is LE-ordered.  the key instruction which helped
determine it was a VSX shift instruction, which had a worked
example.

(this is not actually spelled out clearly and explicitly in
the Power v3.1 spec).
Comment 38 Luke Kenneth Casson Leighton 2023-05-31 17:26:10 BST
(In reply to Paul Mackerras from comment #33)
> (In reply to Jacob Lifshay from comment #31)
> > 
> > it is required in hardware that supports both endians since the byte
> > reversal hardware is what changes wether little-endian or big-endian element
> > indexing is used, by byte reversing inputs/outputs of operations (or any
> > logically equivalent method that is likely much more efficient).
> > 
> > e.g., assuming your endian proposal with VL=4 r3=0x0123_4567_89ab_cdef
> > sv.ori/sw=8/dw=16 *r3, *r3, 0
> > in LE mode produces:
> > r3=0x0089_00ab_00cd_00ef
> > in BE mode produces:
> > r3=0x0001_0023_0045_0067
> 
> Interesting example. I'll have to think about how I would implement that.

the key bit about the example jacob gives is, the source and destination
widths are different but obviously having full crossbars to SIMD ALUs
in front of regfiles may be far too many gates for some implementations
to handle.

therefore it is noted in the spec that "some implementations may be
slower if the source and dest elwidths are not the same" 
 
> Ignoring BE for the moment, what kind of structure do you have in your
> design for handling this kind of source/destination width mismatch? Is it
> something like a bunch of multiplexers ahead of the ALU, or is there a more
> clever way to do it?

i was planning 4 (or 8) lanes of *completely independent* ALUs, on Modulo-4
or Modulo-8.  this may or may not be before or after Register-Renaming.

preceded (for read) / followed (for write) by a 64-bit-wide cyclic
shift queue, in lieu of a full crossbar.  "routing" becomes a simple
count-down with the difference between "(RA modulo 4)" (or mod8) and
"target ALU lane-position".  when that count-down reaches zero, the
In-Flight data is delivered.

this would work perfectly fine for both an Out-of-Order system and In-Order
but an In-Order one would be rather unhappy about the variable-latency
it introduces.  have to be quite careful about that, but even the latency
can be Deterministically calculated (assuming count-down Hazard Protection,
one per register-to-be-written, just like in Microwatt)

REMAP is where that gets *really* expensive (and hairy) but it is still doable.

the principle difference between Simple-V and other Vector ISAs:

normally the logic that would go into e.g. "xxperm" deep down in
one of the pipelines has been "promoted" up to first-order routing not
only on registers but *actual bytes* and now *interacts* with Register
Hazard Management.

it means Hardware Engineers get a bit of a jaw-dropping shock
("you want to do whuuuu?") but if you spend 4+ years thinking it
through it does actually work.

Brad very kindly prompted me here to expand "hphint" to a
first-priority means of making Hardware Engineers lives tolerable:
you can set hphint *GREATER* than MAXVL and it tells Multi-Issue
Hardware that (hphint/MAXVL) batches may be spammed to backend
hardware in each clock cycle *WITHOUT* having to check Register
or Memory Hazards on *any* of the elements within each multi-issue
batch.
Comment 39 Luke Kenneth Casson Leighton 2023-05-31 17:52:13 BST
(In reply to Paul Mackerras from comment #24)

> The ISA as it stands has a property which is extremely useful, which is that
> with a couple of rare exceptions (see below), it is possible to analyse an
> instruction word (or doubleword) and know which CPU registers it is going to
> read and write, without knowing anything about the architected state of the
> CPU.

sigh, yes - the exception to that being the contents of MSR.
(MSR.LE, MSR.SF, and others).  SVSTATE has to be similarly
considered "a peer of MSR and PC" (and SVSHAPE0-3 if REMAP is
implemented, typically in 3D GPUs, HPC, and high-end A/V DSPs)

in Libre-SOC's HDL i have a special "regfile" containing
PC,MSR,SVSTATE,DEC,TB and when REMAP is implemented SVSHAPE0-3
will have to join them

> This simplifies the job of anything that wants to translate or emulate
> instructions, or generally understand what the effect of a block of code
> could be or the dependencies between instructions. Examples include
> valgrind, qemu, gdb, etc.

luckily the register EXTRA information is in the SVP64 24-bit Prefix.
otherwise we _would_ be in trouble, there.

(btw heads-up, the concept of "streaming" utterly borks that. ARM SVE
has "streaming" coming.  https://arxiv.org/pdf/2002.10143.pdf)

> [The exceptions are the lswx and stswx instructions, which use a byte count
> in the XER. The byte count controls the number of GPRs read or written. But
> modern compilers don't use lswx or stswx, and they always cause an alignment
> interrupt in LE mode.]

they were great when CPUs were 130 mhz and single-issue.  multi-issue it
all goes to hell-in-a-handbasket and (following comp.arch regularly)
general consensus is LD/ST-Multi is history.

of course total irony then that Simple-V would *accidentally* re-introduce it:

    sv.ld/sm=r10/els *RT, 0(RA)


> If a side effect of adopting Simple-V is that this property no longer holds,
> then that is a serious problem in my view.

> If it is the case that a 32-bit
> instruction without any prefix could in some cases access different
> registers from those identified in the instruction word, depending on the
> state of the CPU (for example, depending on what is in the SVSTATE SPR),
> then you have broken this property.

well...
 
> I had thought there would be a clear and simple way to tell which
> instructions would be affected by vectorization (i.e., the presence of a
> SVP64 prefix).

all of them!  okok - everything-that-makes-sense.  mtmsr makes no sense.
sync makes no sense.

> But it sounds like that is not true, unfortunately.

it is... but i had to encapsulate it in a program (i sure as hell wasn't
going to do it by hand).

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;hb=HEAD

this program *is* reasonably-obvious, the key function being "create_key()"
which analyses all instructions (remember i mentioned turning decode1.vhdl
into CSV files?) and then creates a "Register Profile footprint" (aka key)
that can be used to decide what bits in the 24-bit Prefix are to be used
to extend registers RT RS RA RB RC BA BFA BB BT FRT ...

sv_analysis.py does actually generate markdown tables so you can see what
it does.

https://libre-soc.org/openpower/opcode_regs_deduped/

anyway - back to the registers: a reasonable way to think of the RTL is
that when Prefixed, all the registers have been "shifted" into a new
namespace.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/fixedarith.mdwn;hb=HEAD
 616 * maddhd RT,RA,RB,RC
 617 
 618 Pseudo-code:
 619 
 620     prod[0:(XLEN*2)-1] <- MULS((RA), (RB))
 621     sum[0:(XLEN*2)-1] <- prod + EXTS(RC)[0:XLEN*2]
 622     RT <- sum[0:XLEN-1]

===>


 620     prod[0:(XLEN*2)-1] <- MULS((SVP64.RA), (SVP64.RB))
 621     sum[0:(XLEN*2)-1] <- prod + EXTS(SVP64.RC)[0:XLEN*2]
 622     SVP64.RT <- sum[0:XLEN-1]

which is how i can convince myself that "the instruction meaning
did not change" - despite being Prefixed.

(will answer about Saturation and Predication in a followup)
Comment 40 Luke Kenneth Casson Leighton 2023-05-31 18:03:42 BST
(In reply to Paul Mackerras from comment #9)

> Is it possible to request saturating arithmetic with the SVP64 prefix on an
> addi instruction? 

yes.  and on logical (ori). we likely need a different meaning for that.

> If so then that would certainly count as a fundamental
> change to the operation being performed.

post-analysis.

(In reply to Paul Mackerras from comment #22)

> What about saturating arithmetic?

caveat: i simply haven't had time yet to even implement saturation
in the Simulator (ISACaller) - answer: i do not consider this to
be *modification* (strictly: "modification of the pseudocode")
 
> Could the instruction in fact do nothing, because of predication?

indeed it could!

   for i in range(VL):
       if not predicate[i]: continue
       GPR(RT+i) = GPR(RA+i) + GPR(RB+i)

note that the *pseudocode* - the actual add - is *not* modified.
that's what i mean "the scalar instruction is not modified).

now, it *may* be the case that we have to over-ride the meaning
of the "+" operator to make it possible to detect that overflow
occurred, but in the case of add in the Power ISA spec, well...
that's done anyway: it's called the "carry" flag XER.CA

now, in *hardware* that means that (quoting Mitch Alsup from
comp.arch about 3 weeks ago) you end up if you want to keep to
a single pipeline you get about a 3-5% reduction in top clock
speed, because (and the HW engineers having done XER.CA will
know about this) the overflow flag needs to go into a MUX-bank
selecting

    "all 1s if XER.CA is set" or
    "the result if XER.CA is clear"

and that MUX-bank - at the end of the add-cascade - gives about
a 3-5% increase in gate-chain-count

(a workaround is a 2-stage pipeline)

but - and this is the key - i still *do not* consider "Saturation"
to be an ACTUAL modification of the INSTRUCTION.  the Pseudocode
does NOT change.
Comment 41 Paul Mackerras 2023-06-01 00:49:28 BST
(In reply to Luke Kenneth Casson Leighton from comment #39)
> (In reply to Paul Mackerras from comment #24)
> 
> > The ISA as it stands has a property which is extremely useful, which is that
> > with a couple of rare exceptions (see below), it is possible to analyse an
> > instruction word (or doubleword) and know which CPU registers it is going to
> > read and write, without knowing anything about the architected state of the
> > CPU.
> 
> sigh, yes - the exception to that being the contents of MSR.
> (MSR.LE, MSR.SF, and others).

No, MSR.LE and MSR.SF don't affect which registers are read or written. They affect the value(s) written but not the identity of the registers concerned.

>  SVSTATE has to be similarly
> considered "a peer of MSR and PC" (and SVSHAPE0-3 if REMAP is
> implemented, typically in 3D GPUs, HPC, and high-end A/V DSPs)
> 
> in Libre-SOC's HDL i have a special "regfile" containing
> PC,MSR,SVSTATE,DEC,TB and when REMAP is implemented SVSHAPE0-3
> will have to join them
> 
> > This simplifies the job of anything that wants to translate or emulate
> > instructions, or generally understand what the effect of a block of code
> > could be or the dependencies between instructions. Examples include
> > valgrind, qemu, gdb, etc.
> 
> luckily the register EXTRA information is in the SVP64 24-bit Prefix.
> otherwise we _would_ be in trouble, there.

So every instruction whose behaviour is modified by vectorization has a SVP64 prefix? I haven't seen a clear and unambiguous answer as to whether that is true or not. (You do seem to say it is true below, except that each such statement seems to have some sort of caveat on it.)

It did seem like a "bare" addi (without SVP64 prefix) in a vertical-first loop might be subject to register index modification, element-width overrides, saturation, etc., from the VF loop. Does that happen, or is it the case that an addi without SVP64 prefix is never subject to any modification (i.e. it only ever accesses the GPRs specified by RA and RT in the instruction word)?

> (btw heads-up, the concept of "streaming" utterly borks that. ARM SVE
> has "streaming" coming.  https://arxiv.org/pdf/2002.10143.pdf)
> 
> > [The exceptions are the lswx and stswx instructions, which use a byte count
> > in the XER. The byte count controls the number of GPRs read or written. But
> > modern compilers don't use lswx or stswx, and they always cause an alignment
> > interrupt in LE mode.]
> 
> they were great when CPUs were 130 mhz and single-issue.  multi-issue it
> all goes to hell-in-a-handbasket and (following comp.arch regularly)
> general consensus is LD/ST-Multi is history.
> 
> of course total irony then that Simple-V would *accidentally* re-introduce
> it:
> 
>     sv.ld/sm=r10/els *RT, 0(RA)
> 
> 
> > If a side effect of adopting Simple-V is that this property no longer holds,
> > then that is a serious problem in my view.
> 
> > If it is the case that a 32-bit
> > instruction without any prefix could in some cases access different
> > registers from those identified in the instruction word, depending on the
> > state of the CPU (for example, depending on what is in the SVSTATE SPR),
> > then you have broken this property.
> 
> well...
>  
> > I had thought there would be a clear and simple way to tell which
> > instructions would be affected by vectorization (i.e., the presence of a
> > SVP64 prefix).
> 
> all of them!  okok - everything-that-makes-sense.  mtmsr makes no sense.
> sync makes no sense.

I was concerned with the case where there is no SVP64 prefix before an instruction. In that case, is it correct to say that it is guaranteed to behave exactly in all respects as specified in the current architecture, regardless of any values in SVSTATE or any other SPR?

> > But it sounds like that is not true, unfortunately.
> 
> it is... but i had to encapsulate it in a program (i sure as hell wasn't
> going to do it by hand).

OK, that's good.

> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/
> sv_analysis.py;hb=HEAD
> 
> this program *is* reasonably-obvious, the key function being "create_key()"
> which analyses all instructions (remember i mentioned turning decode1.vhdl
> into CSV files?) and then creates a "Register Profile footprint" (aka key)
> that can be used to decide what bits in the 24-bit Prefix are to be used
> to extend registers RT RS RA RB RC BA BFA BB BT FRT ...
> 
> sv_analysis.py does actually generate markdown tables so you can see what
> it does.
> 
> https://libre-soc.org/openpower/opcode_regs_deduped/
> 
> anyway - back to the registers: a reasonable way to think of the RTL is
> that when Prefixed, all the registers have been "shifted" into a new
> namespace.
> 
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/
> fixedarith.mdwn;hb=HEAD
>  616 * maddhd RT,RA,RB,RC
>  617 
>  618 Pseudo-code:
>  619 
>  620     prod[0:(XLEN*2)-1] <- MULS((RA), (RB))
>  621     sum[0:(XLEN*2)-1] <- prod + EXTS(RC)[0:XLEN*2]
>  622     RT <- sum[0:XLEN-1]
> 
> ===>
> 
> 
>  620     prod[0:(XLEN*2)-1] <- MULS((SVP64.RA), (SVP64.RB))
>  621     sum[0:(XLEN*2)-1] <- prod + EXTS(SVP64.RC)[0:XLEN*2]
>  622     SVP64.RT <- sum[0:XLEN-1]
> 
> which is how i can convince myself that "the instruction meaning
> did not change" - despite being Prefixed.

This is the converse concern to mine. This is about the prefixed case, I was concerned about the non-prefixed case.

> (will answer about Saturation and Predication in a followup)
Comment 42 Paul Mackerras 2023-06-01 00:58:04 BST
(In reply to Luke Kenneth Casson Leighton from comment #40)

> but - and this is the key - i still *do not* consider "Saturation"
> to be an ACTUAL modification of the INSTRUCTION.  the Pseudocode
> does NOT change.

I disagree. A saturating addition is not the same as a normal addition. If you don't change the RTL then it would no longer accurately describe the operation being performed, since the meaning of "+" is defined in section 1.3.4 to be two's complement addition. You either need to replace the "+" in the RTL with some notation for a saturating addition, or you need to add a statement that calls some kind of saturate() function on the result before writing it to the destination GPR.

Look at the RTL for vaddsbs, for instance. It does si8_CLAMP(src1 + src2). (The saturation functions are called xxx_CLAMP() in the vector chapter.)
Comment 43 Paul Mackerras 2023-06-01 01:14:28 BST
(In reply to Luke Kenneth Casson Leighton from comment #37)
> > (in LE, since that's always what svp64 uses for registers):
> 
> it's way more than that: it's what the *Power ISA* uses for
> registers (all of them).  it took several hours review of the

This is a misunderstanding of endianness. Endianness concerns the relative weighting of individually addressable pieces of a larger entity. GPRs and FPRs don't have individually addressable pieces and hence don't have endianness.

You may perhaps be confusing the direction that carries propagate in arithmetic with endianness. Carries always propagate from less significant to more significant digits, because that's how arithmetic works. That is nothing to do with endianness. The Power ISA for a very long time was exclusively big-endian, but carries always propagated from less significant to more significant bits.

> spec on VSX (pretty much every instruction) to *deduce* that
> arithmetic is LE-ordered.  the key instruction which helped
> determine it was a VSX shift instruction, which had a worked
> example.

VSX does have individually addressable pieces of registers, in that the various permute-class instructions can address individual bytes using an index. In those cases the bytes are usually numbered left-to-right, so if anything, VSX registers are big-endian. In v3.1 there have been some instructions added which are "right-indexed", for example vpermr, which do things like "31-index" in the RTL to get the effect of little-endian indexing of the bytes of the register. So in that sense VSX registers can be considered to be of either endianness.

> (this is not actually spelled out clearly and explicitly in
> the Power v3.1 spec).

Saying "arithmetic is LE-ordered" is a meaningless statement, which would be why it isn't in the ISA.
Comment 44 Paul Mackerras 2023-06-01 02:12:01 BST
(In reply to Luke Kenneth Casson Leighton from comment #35)
> (In reply to Paul Mackerras from comment #30)
> 
> > I think you mean sv.addi/elwidth=16 5,5,0x1122 (not 5,_0_,0x1122).
> 
> ah! yes
> 
> > I'll assume the 0 for RA is a typo caused by 3.27AM.
> > 
> > > * then inspect (verilator) GPR(5) and read its contents
> > > 
> > > is the answer you expect, regardless of LE/BE: 0x2356?
> > > or would it be 
> > > * 0x2211_0000_0000_1234 (or 0x1122_0000_0000_1234) *or*
> > > * 0x0000_0000_0000_3456 due to addi being implicitly
> > >   reversed-byte-order from sv.addi under BE?
> > 
> > I would expect 0x1122_0000_0000_1234 in BE mode, since you have operated on
> > element 0 and elements are 16 bits wide.
> 
> ahhh now *that* makes it clear.  and is so far left-field of what i
> was modelling/expecting from the combinatorial explosion of possibilities
> that i couldn't possible guess it :)
> 
> now, here's the thing (walk through the implications).  where the LE
> element-access would be this:
> 
>      # assume everything LE-ordered and LSB-numbered
>      gpr_width = 8 # bytrs
>      num_gprs = 128 # in "upper" SV Compliancy Levels
>      GPR_sram = [0x00] * gpr_width * num_gprs
>      src_elbytes = src_elwidth // 8
>      for i in range(VL):
>          bytenum = i * src_elbytes # element offset in SRAM bytes
>          ra_element_start = RA*gpr_width     # vector start position
>          ra_element_start += bytenum # element offset
>          ra_element_end   = ra_element_end + (src_elbytes-1)
>          ra_src_operand = GPR_sram[ra_element_start thru ra_element_end]
> 
> a BE-reversal of the underlying SRAM-access would be:
> 
>      # *still* assume everything LE-ordered and LSB-numbered
>      gpr_width = 8 # bytrs
>      num_gprs = 128 # in "upper" SV Compliancy Levels
>      GPR_sram = [0x00] * gpr_width * num_gprs
>      src_elbytes = src_elwidth // 8
>      for i in range(VL):
>          offset = i * src_elbytes           # element offset in SRAM bytes
>          gpr_num = offset // gpr_width      # relative GPR number  
>          bytenum = offset %  gpr_width      # byte-start in GPR
> ---->    bytenum = ~bytenum & 0b1111_1111   # BE-inversion

No, this isn't right.  It should be

         bytenum = bytenum ^ (8 - src_elbytes)

>          # now finally we know the element-offset start pos
>          ra_element_start = (gpr_num * gpr_width) + bytenum
>          ra_element_start += RA*gpr_width     # add vector start position
>          ra_element_end   = ra_element_end + (src_elbytes-1)         
>          ra_src_operand = GPR_sram[ra_element_start thru ra_element_end]
> 
> 
> at which point i think you'd agree that trying to explain that to
> programmers, that this is the underlying model, would be a bit much :)
> 
> 
> > > now the same thing with *scalar* instructions:
> > > 
> > > * let us set (verilator or "addi 5,0,0x1234") the contents of GPR(5) = 0x1234
> > > * perform "addi 5,0,0x1122"
> > > * then inspect (verilator) GPR(5) and read its contents
> > > 
> > > is it *still* 0x23567 regardless of LE/BE?
> > 
> > It's 0x2356 regardless of LE/BE.
> 
> and that discrepancy is a violation of (one of the) Orthogonality rule(s).
> when MAXVL=VL=1 the behaviour *has* to be the same (elwidth
> notwithstanding)

The behaviour clearly does depend on elwidth (even in LE mode), because the scalar instruction writes all 64 bits of the register but the vectorized instruction with VL=1 only writes elwidth bits.
Comment 45 Paul Mackerras 2023-06-01 02:20:39 BST
On further reflection, it seems like the point being made above by Luke might be something along these lines:

If you have a vector of elements stored in a register, and you want to operate on the 0th element and you don't care about corrupting the others, then you can in many cases use a scalar instruction instead of the equivalent vector instruction with VL=1. This is a valuable property.

This would be true for things like add, subtract, multiply, left shift, and bitwise operations, but not divide, right shift, compare.

I can see that this is a useful property.
Comment 46 Luke Kenneth Casson Leighton 2023-06-02 01:11:35 BST
(In reply to Paul Mackerras from comment #4)

> The RFC needs to be based on v3.1B rather than v3.0B (particularly since it
> depends on prefixed instructions).

ahhh... took me a while to think this through: strictly, it doesn't.
Prefixed-Prefixed 96-bit instructions are flat-out insane, so are
excluded.  we can't have sv.paddi or sv.pld.  and the SVP64 format
itself is independent of the Defined Word-Instruction.
[except for the annoying need to know the instruction category)
although obtuse even v2.0 or v1.0 could be SVP64-Prefixed!

a reason to update is down to wording, and what v3.1B contains.
the scalar RFCs i'll leave alone as there are 10+ of them and
it makes no odds: none of them are EXT1xx.
Comment 47 Jacob Lifshay 2023-06-02 06:02:28 BST
(In reply to Luke Kenneth Casson Leighton from comment #46)
> a reason to update is down to wording, and what v3.1B contains.
> the scalar RFCs i'll leave alone as there are 10+ of them and
> it makes no odds: none of them are EXT1xx.

I'll note that all RFCs I'm writing are based on v3.1B, though they haven't all been submitted yet.
Comment 48 Luke Kenneth Casson Leighton 2023-06-02 09:38:26 BST
(In reply to Jacob Lifshay from comment #47)

> I'll note that all RFCs I'm writing are based on v3.1B, though they haven't
> all been submitted yet.

ah then do make sure that the header (lsNNN) wiki page and page-refs
all say "v3.1B".

we really do need to have some machine-readable conventions on
referring to pages (creating a ref-tag in the pandoc conversion)
Comment 49 Jacob Lifshay 2023-06-02 09:46:34 BST
(In reply to Luke Kenneth Casson Leighton from comment #48)
> (In reply to Jacob Lifshay from comment #47)
> 
> > I'll note that all RFCs I'm writing are based on v3.1B, though they haven't
> > all been submitted yet.
> 
> ah then do make sure that the header (lsNNN) wiki page and page-refs
> all say "v3.1B".

I did, everywhere i saw a `v3.0` or a section number from v3.0:
https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/rfc/ls006.fpintmv.mdwn;h=36a67d26302b9a2f6d61060f1dd266ab279b8218;hb=HEAD#l18
Comment 50 Luke Kenneth Casson Leighton 2023-06-02 15:06:17 BST
(In reply to Paul Mackerras from comment #8)

> You say "and the four REMAP SPRs if in use at the time". How is an interrupt
> handler to know whether the REMAP SPRs are in use?

they are non-zero and the bits in SVSTATE are also non-zero.
https://libre-soc.org/openpower/sv/remap/?#svstate_remap_area

```
    |32:33|34:35|36:37|38:39|40:41| 42:46 | 62     |
    | --  | --  | --  | --  | --  | ----- | ------ |
    |mi0  |mi1  |mi2  |mo0  |mo1  | SVme  | RMpst  |
```

having to perform a shift-and-mask on those bits, right in a
context-switch or function call pre- or post- amble, will
be *really* costly.

the solution to that is to add an SPR which has a "window"
onto those bits. will crossref here as it is the same
solution used by RISC-V debug/trap-emulate
https://git.openpower.foundation/isa/PowerISA/issues/143
Comment 51 Luke Kenneth Casson Leighton 2023-06-02 16:34:57 BST
(In reply to Paul Mackerras from comment #41)

> So every instruction whose behaviour is modified by vectorization has a
> SVP64 prefix? 

has to, yes.  HOWEVER... and this is waaay into the future: due to
the startling sililarity to ZOLC i have long-term plans to *SEPARATE*
the 24-bits into a SEPARATE (3rd) L1 Cache.

    https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.301.4646&rep=rep1&type=pdf

that will be a huge research project on its own.



> I haven't seen a clear and unambiguous answer as to whether
> that is true or not. (You do seem to say it is true below, except that each
> such statement seems to have some sort of caveat on it.)

it is.  an Embedded Finite State Machine (and Libre-SOC's TestIssuer
does this) would:

* read the PO9-word
* cache the 24-bit RM area and prohibit interrupts
* read the next 32-bit word
* throw {24}{32} at decode+issue+execute and re-enable interrupts

and that is an important Micro-Architecture to have (minimum resources).

i considered at some point having an actual SPR to store the state in
between the two (the PO9-word and Defined-word-instruction) but i feel
it is a tiny bit overkill.

attitudes on that vary, certainly the use of the same technique in
RISC-V does not make people happy (the 18-bit in one instruction being
concatenated with a 12-or-so bit immediate in the following instruction)



> It did seem like a "bare" addi (without SVP64 prefix) in a vertical-first
> loop might be subject to register index modification, 

no absolutely not.  ok, i considered it, it is called "register tagging"
which historically has been left by the wayside but is making a
comeback in "Vector Streaming" in ARM SVE, Eth-Zurich Snitch, and
the European Processor Initiative.

the problem with tagging is that it becomes part of the Architectural
State (an SPR or in this case *group* of SPRs), which massively
complexifies simulators debuggers etc.
but also context-switch becomes absolute hell.

there are 3 bits needed per QTY 32-of GPR, FPR, CR, and  QTY 64-of VR.
4 is better. that's something like... what.... 32 64-bit SPRs? (!!!)
jacob and i went through a LOT of compression schemes on that, but
they were barely workable and involved a high instruction overhead.

also, imagine me thinking ahead and going "what would the ISA WG
accept?" - i don't bother with things that would not pass that filter :)


> element-width
> overrides, saturation, etc., from the VF loop. Does that happen, or is it
> the case that an addi without SVP64 prefix is never subject to any
> modification (i.e. it only ever accesses the GPRs specified by RA and RT in
> the instruction word)?

correct. [future-SVP64-Single on the other hand is an entirely different
matter, best left for another time]

what that means - and this is really neat and innovative - you can
use an *entire chain* of temporary intermediate Scalar instructions
in what otherwise is a Vector Loop

(!?!?!)

the classic example is Complex Number FFT.  i implemented that as
a Vertical-First Loop

what *that* means is that you are not wasting massive amounts
of temporary Vector Registers just because all the Vector Arithmetic
is "Horizontal" (elemts-first), you can *mix and match*.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;h=fceb6b38#l612

 612             "sv.fmuls 24, *0, *16",    # mul1_r = r*cos_r
 613             "sv.fmadds 24, *8, *20, 24",  # mul2_r = i*sin_i
 ...
 620             "sv.ffadds *0, 24, *0",    # vh/vl +/- tpre

here it is:

* reading *vector-indexed* sources but the destination r24 is scalar
* r24 the *scalar* goes into FP mul-add-sub producing r24 *scalar*
* r24 *scalar* goes into twin /+- butterfly taking
  *0 as one side of the input-output and
  *(0+MAXVL) as the other

and if the loop is small enough to fit into Multi-Issue Reservation
Stations then WaW register-renaming may AUTO-VECTORIZE r24 and place
it into the exact same massive wide SIMD back-end ALUs as the other
(explicit) Vector registers.

Any Horizontal-only ISA whether Vector or SIMD *has* to allocate
an entire *Vector* r24 because there is no other option but to
work EXPLICITLY at the width of the SIMD/Vector register,
element-for-element.


> I was concerned with the case where there is no SVP64 prefix before an
> instruction. In that case, is it correct to say that it is guaranteed to
> behave exactly in all respects as specified in the current architecture,
> regardless of any values in SVSTATE or any other SPR?

aabsolutely correct. can you imagine the freaking-out that would occur?
i can :)

(and save/restore context-switch would become a nightmare, you'd have
no idea if you could safely use even one GPR!)
Comment 52 Luke Kenneth Casson Leighton 2023-06-02 16:51:15 BST
(In reply to Paul Mackerras from comment #42)
> (In reply to Luke Kenneth Casson Leighton from comment #40)
> 
> > but - and this is the key - i still *do not* consider "Saturation"
> > to be an ACTUAL modification of the INSTRUCTION.  the Pseudocode
> > does NOT change.
> 
> I disagree. A saturating addition is not the same as a normal addition.

it becomes:

   temp <- [0]*(XLEN+1)
   ra   <-  0b0 || (RA)
   rb   <-  0b0 || (RB)
   temp <- ra + rb
   if temp[0] then RT <- [1]*XLEN
   else            RT <- temp[1:XLEN]

which is, right up to the test temp[0], identical to add and when addeo
is examined i think you'll agree it is "if XER.CA then saturate"




> If
> you don't change the RTL then it would no longer accurately describe the
> operation being performed, since the meaning of "+" is defined in section
> 1.3.4 to be two's complement addition. 

the sane thing to do is override that.  it already is (sort-of).
XER.CA is *not* in the actual RTL, only described in english language.

(i.e. the precedent of doing "a little bit more than what the RTL says"
 is already set).


> You either need to replace the "+" in
> the RTL with some notation for a saturating addition, or you need to add a
> statement that calls some kind of saturate() function on the result before
> writing it to the destination GPR.

at which point we have identical-RTL-except-the-call-to-Saturate
to maintain, QTY hundreds of instructions.

that ain't happenin :)


> Look at the RTL for vaddsbs, for instance. It does si8_CLAMP(src1 + src2).
> (The saturation functions are called xxx_CLAMP() in the vector chapter.)

been through the options.  post-analysis *like Rc=1 and XER.SO/CA*
was the sane conclusion, checking that would be v. helpful, bear in
mind i have not got round to implementing Saturate in ISACaller
so everything surrounding saturate is "unconfirmed" at the moment.

(i do NOT allow things to go into the spec without also implementing
 them in the Simlator, ISACaller, including full unit tests.
 Saturate is the very last "big" feature that simply hasn't had time
 to be implemented yet)
Comment 53 Luke Kenneth Casson Leighton 2023-06-02 16:58:48 BST
(In reply to Paul Mackerras from comment #25)
> (In reply to Luke Kenneth Casson Leighton from comment #12)

> > it is *always* a Defined-word-instruction, where that DWI **MUST**
> > have the exact same definition as if it had no prefix at all
> > (caveat: bar the niggles on elwidth).
> 
> I think it's not just element width, it's also the possiblity of doing
> multiple operations, 

it *is* a sequentially-ordered loop, that is mandatory.

> and potentially not doing some or all of them, 

standard In/Out-Order Single/Multi Issue Hazard techniques
have that covered.  even on REMAP.

> leaving the corresponding parts of the destination register unchanged.

solved with byte-level write-enable lines


> > split-field si0-si1, it seems like i am freaking out, but i really
> > *really* don't want to hit the Power ISA Spec with 200+ changes
> > to the RTL and instruction definitions.
> 
> Yeah, neither do I. But the effects of vectorization do have to be
> completely and accurately described somewhere.

indeed.

> > if you really really are asking for split fields to be introduced
> > for RT, RA, RS, RB, BA, BFA, FRS, FRC, BT (basically everything)
> > then i feel the entire suite - over 200 {PO9-DWI}s - should be
> > autogenerated.
> 
> Sorry, I don't get why you're talking about split fields here. I don't
> recall mentioning split fields in this discussion.

si0-si1 in EXT1xx, the IMM field is a new definition

EXTRA bits in the 24-bit RM prefix area combined with e.g. RT=insn[6:10]
technically/pedantically RT as a 7-bit is a "split field".
Comment 54 Luke Kenneth Casson Leighton 2023-06-02 17:40:44 BST
(In reply to Paul Mackerras from comment #26)
> (In reply to Luke Kenneth Casson Leighton from comment #14)

> >  https://libre-soc.org/openpower/sv/rfc/ls010/hypothetical_addi/ )
> 
> ah, is that where the split fields thing came from...

yes, from PO1-Prefixing, consider the {24-bit}{32-bit} in the same
way.  i consider such an approach to be a mistake for SV.

jacob and i went to a LOT of trouble to ensure that SV is an
orthogonal consistent RISC paradigm.

therefore just saying RT moves to "a Vector Context" uniformly
across the *ENTIRE* spec is both possible and reasonable.
Comment 55 Luke Kenneth Casson Leighton 2023-06-02 17:58:39 BST
(In reply to Luke Kenneth Casson Leighton from comment #51)
> (In reply to Paul Mackerras from comment #41)

> > element-width
> > overrides, saturation, etc., from the VF loop. Does that happen, or is it
> > the case that an addi without SVP64 prefix is never subject to any
> > modification (i.e. it only ever accesses the GPRs specified by RA and RT in
> > the instruction word)?
> 
> correct.

i just realised this example is not fully illustrative, let me add
some extra instructins ..

> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/
> decoder/isa/test_caller_svp64_fft.py;h=fceb6b38#l612
 
 612             sv.fmuls 24, *0, *16       # mul1_r = r*cos_r
 613             sv.fmadds 24, *8, *20, 24  # mul2_r = i*sin_i

                 fadds 25,24,31  # here is definitely scalar
                 fmuls 25,25,30  # do as many as you like..
                 fsubs 24,24,31  # as long as fp24 is dest

 620             sv.ffadds *0, 24, *0       # vh/vl +/- tpre

the reason i added those extra instructions, they are definitely
scalar.  fp24 as a destination could become confused as being
"Vector" because it is used in an SVP64-prefix context, it isn't.

you can tell the difference by the "*" notation (from c pointer
notation)
Comment 56 Luke Kenneth Casson Leighton 2023-06-04 12:20:04 BST
(In reply to Luke Kenneth Casson Leighton from comment #2)
> a rather annoying quirk showed up: LD/ST-update when changed to EXTRA3
> and used for a linked-list-walking test came across a problem where
> the "normal" check "assert RA!=RT" must now be
> 
>     "assert RA||EXTRA3_bits != RT||EXTRA3_bits"
> 
> with the usual method being to write out a prefix as ".long XXXXX"
> followed by its 32-bit Defined Word (suffix) it will be necessary
> as a temporary workaround to use pysvp64dis and SVP64Asm to bypass this
> check and perform its own.
> 
> this needs to have a corresponding spec update as well as binutils

Paul this comes down to "RA" and "RT" *not* changing "meaning" but
instead... dare i say it... "moving to a different namespace"
(the uniform orthogonal SVP64-RISC-paradigm-namespace)"

if we had to write this up as explicit RTL it would be extremely
dumb, being only an "operand qualification".

    if insn == "sv.ldu" then
       rt <- EXTRA3_PREFIXED(RT, SVRM.EXTRA) # 7-bit
       ra <- EXTRA3_PREFIXED(RA, SVRM.EXTRA) # 7-bit
    else
       rt <- RT                              # 5-bit
       ra <- RA                              # 5-bit

then replace "if RT != RA EXCEPTION()"
with         "if rt != ra EXCEPTION()"

and it looks pretty pointless an exercise, worse it is unnecessary
work after which *the spec is rendered unreadable*!

thus it is much better to have a chapter that defines the
format and operand-maps, and says "apply this namespace as a
uniform transformation"

in this way not even the English Language parts of the Scalar spec
change.

now, would it be useful to have some auto-generated header information
that shows SVP64 Operand Mapping, in the instruction definition?
it is debatable.

* on one hand it will interfere with spec readability due to reducing
  compactness (estimated 1.5 inches of PDF lines *per instruction*)
* on the other it will "make clear" the context of what SVP64 is
  actually about

to that end can i suggest doing a trial-run on one single instruction?
(i would recommend addi precisely because it is EXT1xx prefixable)

edit: made a start
https://libre-soc.org/openpower/sv/rfc/ls010/trial_addi/
Comment 57 Paul Mackerras 2023-06-05 23:54:43 BST
(In reply to Luke Kenneth Casson Leighton from comment #50)
> (In reply to Paul Mackerras from comment #8)
> 
> > You say "and the four REMAP SPRs if in use at the time". How is an interrupt
> > handler to know whether the REMAP SPRs are in use?
> 
> they are non-zero and the bits in SVSTATE are also non-zero.
> https://libre-soc.org/openpower/sv/remap/?#svstate_remap_area
> 
> ```
>     |32:33|34:35|36:37|38:39|40:41| 42:46 | 62     |
>     | --  | --  | --  | --  | --  | ----- | ------ |
>     |mi0  |mi1  |mi2  |mo0  |mo1  | SVme  | RMpst  |
> ```

Hmmm, I think a user program will expect the REMAP SPRs not to change underneath it even if those SVSTATE bits are zero.

So it looks like the only way to avoid having to save/restore the REMAP SPRs is if there is a facility enable bit for SV (or something like that) that would cause an interrupt if the program accesses any SV SPR.
Comment 58 Paul Mackerras 2023-06-06 00:35:21 BST
(In reply to Luke Kenneth Casson Leighton from comment #51)
> (In reply to Paul Mackerras from comment #41)
> 
> > I haven't seen a clear and unambiguous answer as to whether
> > that is true or not. (You do seem to say it is true below, except that each
> > such statement seems to have some sort of caveat on it.)
> 
> it is.  an Embedded Finite State Machine (and Libre-SOC's TestIssuer
> does this) would:
> 
> * read the PO9-word
> * cache the 24-bit RM area and prohibit interrupts
> * read the next 32-bit word
> * throw {24}{32} at decode+issue+execute and re-enable interrupts
> 
> and that is an important Micro-Architecture to have (minimum resources).
> 
> i considered at some point having an actual SPR to store the state in
> between the two (the PO9-word and Defined-word-instruction) but i feel
> it is a tiny bit overkill.

I think there is an important point here about how Simple-V is understood and explained, which I hope I can get you to reconsider. You (Luke) seem adamant that a vectorized instruction is not to be thought of as a prefixed instruction, whereas it seems to me that explaining it as a prefixed instruction has a lot of advantages and is the natural way to think about it.

The reasons for thinking of the instruction word containing the vectorization parameters as a prefix, and the following instruction word as a suffix, are:

* The meaning and interpretation of the first word depend on the second. So the first word is not an independent instruction in its own right. If the first word were an instruction in its own right then you would be able to tell me exactly what it does without reference to any following instruction.

* The meaning and interpretation of the second word do in fact get changed by the first word. The first word can change which registers are accessed, which parts of the registers, and other aspects such as what type of addition is performed (saturating vs. two's-complement).

* You can't allow any interrupt, not even machine check or system reset, between the execution of the first word and the execution of the second. In other words you can't really let the first word "instruction" complete until the second word has done its work.

* If you started to execute the function identified by the two words together and then wanted to take an interrupt before you were finished, you would need to set SRR0 to point to the first word, not the second. But if you think of the two words as two separate instructions, you would naturally set SRR0 to point to the second word, which would be wrong.

* If the two words are really two separate instructions, then you don't really have any grounds for prohibiting certain instructions as the second word.

All of that says to me that the pair of words looks like a prefixed instruction and acts like a prefixed instruction. So why not just call it one?

> attitudes on that vary, certainly the use of the same technique in
> RISC-V does not make people happy (the 18-bit in one instruction being
> concatenated with a 12-or-so bit immediate in the following instruction)
> 
> 
> 
> > It did seem like a "bare" addi (without SVP64 prefix) in a vertical-first
> > loop might be subject to register index modification, 
> 
> no absolutely not.  ok, i considered it, it is called "register tagging"
> which historically has been left by the wayside but is making a
> comeback in "Vector Streaming" in ARM SVE, Eth-Zurich Snitch, and
> the European Processor Initiative.
> 
> the problem with tagging is that it becomes part of the Architectural
> State (an SPR or in this case *group* of SPRs), which massively
> complexifies simulators debuggers etc.
> but also context-switch becomes absolute hell.

Right.

Thinking about this in a way that requires some kind of state to be set by the execution of the first word which then affects what the second word does, really does seem to me to add complexity and not aid comprehension. Any kind of hidden state tends to create difficulties (I think the only hidden state we have in the ISA at the moment is the reservation), and if you expose that state then you make it possible to be set by other means, which opens another whole can of worms.

Having an actual SPR to expose that intermediate state would be a bad idea in my opinion because it would destroy the property of being able to tell what registers are read and written just from the instruction word(s). I realize that SV already lacks that property for any vectorized instruction, but making that intermediate state explicit, and settable by other means (e.g. mtspr) would destroy that property for any instruction with register operands (i.e. almost all of them).

[snip]

> > I was concerned with the case where there is no SVP64 prefix before an
> > instruction. In that case, is it correct to say that it is guaranteed to
> > behave exactly in all respects as specified in the current architecture,
> > regardless of any values in SVSTATE or any other SPR?
> 
> aabsolutely correct. can you imagine the freaking-out that would occur?
> i can :)

I was starting to do that freaking out, so yes.

> (and save/restore context-switch would become a nightmare, you'd have
> no idea if you could safely use even one GPR!)
Comment 59 Paul Mackerras 2023-06-06 01:05:26 BST
(In reply to Luke Kenneth Casson Leighton from comment #52)
> (In reply to Paul Mackerras from comment #42)
> > (In reply to Luke Kenneth Casson Leighton from comment #40)
> > 
> > > but - and this is the key - i still *do not* consider "Saturation"
> > > to be an ACTUAL modification of the INSTRUCTION.  the Pseudocode
> > > does NOT change.
> > 
> > I disagree. A saturating addition is not the same as a normal addition.
> 
> it becomes:
> 
>    temp <- [0]*(XLEN+1)
>    ra   <-  0b0 || (RA)
>    rb   <-  0b0 || (RB)
>    temp <- ra + rb
>    if temp[0] then RT <- [1]*XLEN
>    else            RT <- temp[1:XLEN]
> 
> which is, right up to the test temp[0], identical to add

The RTL is different from a non-saturating add, so you have changed the function of the instruction.

> and when addeo
> is examined i think you'll agree it is "if XER.CA then saturate"

No, I don't agree. It could be "if addc would set XER.CA then saturate" but we could be doing add or addi.

> > If
> > you don't change the RTL then it would no longer accurately describe the
> > operation being performed, since the meaning of "+" is defined in section
> > 1.3.4 to be two's complement addition. 
> 
> the sane thing to do is override that.  it already is (sort-of).
> XER.CA is *not* in the actual RTL, only described in english language.
> 
> (i.e. the precedent of doing "a little bit more than what the RTL says"
>  is already set).

As far as things that aren't mentioned in the RTL, yes. Can you point to any place where the RTL says a particular value is stored in a register but in fact a different value is stored?

> > You either need to replace the "+" in
> > the RTL with some notation for a saturating addition, or you need to add a
> > statement that calls some kind of saturate() function on the result before
> > writing it to the destination GPR.
> 
> at which point we have identical-RTL-except-the-call-to-Saturate
> to maintain, QTY hundreds of instructions.
> 
> that ain't happenin :)

It seems to me that we don't yet have a good solution to this problem of how to describe what vectorization does to existing scalar instructions, and it may take a while to come up with something that satisfies all parties.
Comment 60 Paul Mackerras 2023-06-06 01:09:53 BST
(In reply to Luke Kenneth Casson Leighton from comment #53)
> (In reply to Paul Mackerras from comment #25)
> > (In reply to Luke Kenneth Casson Leighton from comment #12)
> 
> > > it is *always* a Defined-word-instruction, where that DWI **MUST**
> > > have the exact same definition as if it had no prefix at all
> > > (caveat: bar the niggles on elwidth).
> > 
> > I think it's not just element width, it's also the possiblity of doing
> > multiple operations, 
> 
> it *is* a sequentially-ordered loop, that is mandatory.
> 
> > and potentially not doing some or all of them, 
> 
> standard In/Out-Order Single/Multi Issue Hazard techniques
> have that covered.  even on REMAP.

That's implementation. I was talking about specification (which by assuming the sequential execution model avoids all of that).

> > leaving the corresponding parts of the destination register unchanged.
> 
> solved with byte-level write-enable lines

Once again, that's implementation, not specification. The specification does need to specify which parts of the destination register are modified and which are left unchanged.

> 
> > > split-field si0-si1, it seems like i am freaking out, but i really
> > > *really* don't want to hit the Power ISA Spec with 200+ changes
> > > to the RTL and instruction definitions.
> > 
> > Yeah, neither do I. But the effects of vectorization do have to be
> > completely and accurately described somewhere.
> 
> indeed.
> 
> > > if you really really are asking for split fields to be introduced
> > > for RT, RA, RS, RB, BA, BFA, FRS, FRC, BT (basically everything)
> > > then i feel the entire suite - over 200 {PO9-DWI}s - should be
> > > autogenerated.
> > 
> > Sorry, I don't get why you're talking about split fields here. I don't
> > recall mentioning split fields in this discussion.
> 
> si0-si1 in EXT1xx, the IMM field is a new definition
> 
> EXTRA bits in the 24-bit RM prefix area combined with e.g. RT=insn[6:10]
> technically/pedantically RT as a 7-bit is a "split field".

The ISA, being a specification, does tend to be technical and pedantic. :)
Comment 61 Paul Mackerras 2023-06-06 01:27:18 BST
(In reply to Luke Kenneth Casson Leighton from comment #54)
> (In reply to Paul Mackerras from comment #26)
> > (In reply to Luke Kenneth Casson Leighton from comment #14)
> 
> > >  https://libre-soc.org/openpower/sv/rfc/ls010/hypothetical_addi/ )
> > 
> > ah, is that where the split fields thing came from...
> 
> yes, from PO1-Prefixing, consider the {24-bit}{32-bit} in the same
> way.  i consider such an approach to be a mistake for SV.
> 
> jacob and i went to a LOT of trouble to ensure that SV is an
> orthogonal consistent RISC paradigm.

That's good, but it's orthogonal to whether the pair of words are considered a single instruction or a pair of instructions.

I am not saying that you would need to define a split field for each register operand. There may well be a way to define the register operands precisely without having to do that. But I definitely consider the SV vectorization word and the following instruction word (which would be a scalar instruction in the absence of the preceding word) to be a single instruction.

I understand that you may feel that defining the pair of words as a single instruction might open the gates to defining very different interpretations of the second word when vectorized compared to what it would be on its own as a scalar instruction. The appropriate ways to defend against that are (a) don't define any such very different interpretation in your proposal, (b) add an Architecture Note explaining the consequences of defining a very different interpretation, and (c) stay involved with the ISA TWG and vote against any future proposal that seeks to add any such very different interpretation.

> therefore just saying RT moves to "a Vector Context" uniformly
> across the *ENTIRE* spec is both possible and reasonable.

It may well be possible to do that, but we would need to see the actual words.

I think that the saturating mode is very much a complicating factor here. If the effect of vectorization was limited to changing operand sizes and locations, that would be much easier to fit within the concept of a "vector context". Maybe you should consider defining saturating arithmetic instructions as new scalar instructions rather than having a saturating mode as part of vectorization.
Comment 62 Paul Mackerras 2023-06-06 02:08:56 BST
(In reply to Luke Kenneth Casson Leighton from comment #56)
> 
> to that end can i suggest doing a trial-run on one single instruction?
> (i would recommend addi precisely because it is EXT1xx prefixable)
> 
> edit: made a start
> https://libre-soc.org/openpower/sv/rfc/ls010/trial_addi/

This has a lot going for it. It doesn't express the looping aspect of sv.addi, but that may well be preferable, i.e. we probably want to define the looping in the SV chapter and say that the individual instruction descriptions just define one iteration.

(I think you need to change 'if "addi" then' in the RTL to 'if "addi" or "sv.addi" then'.)

On that page you say "a danger of even declaring the existence "sv.addi RT,RA,SI" is the assumption that it is different from addi RT,RA,SI". People don't get to make assumptions about what sv.addi is; you get to define it for them. If people have latitude to make assumptions about it then the spec is not precise enough.
Comment 63 Paul Mackerras 2023-06-06 02:16:48 BST
(In reply to Luke Kenneth Casson Leighton from comment #54)
> 
> jacob and i went to a LOT of trouble to ensure that SV is an
> orthogonal consistent RISC paradigm.

Just as a side note, orthogonality does have an engineering cost, particularly in terms of verification. Sometimes it is pragmatically necessary to limit orthogonality in order to keep the verification state space manageable. In this case, that might mean having a defined set of scalar instructions which can be vectorized, rather than saying that almost any scalar instruction can be vectorized. I know that seems sub-optimal conceptually, but it may be necessary for practical reasons, particularly for an initial implementation. The set of vectorizable instructions can always be expanded later.
Comment 64 Luke Kenneth Casson Leighton 2023-06-06 15:15:48 BST
(In reply to Paul Mackerras from comment #63)
> (In reply to Luke Kenneth Casson Leighton from comment #54)
> > 
> > jacob and i went to a LOT of trouble to ensure that SV is an
> > orthogonal consistent RISC paradigm.
> 
> Just as a side note, orthogonality does have an engineering cost,
> particularly in terms of verification. Sometimes it is pragmatically
> necessary to limit orthogonality in order to keep the verification state
> space manageable. In this case, that might mean having a defined set of
> scalar instructions which can be vectorized, rather than saying that almost
> any scalar instruction can be vectorized. I know that seems sub-optimal
> conceptually, but it may be necessary for practical reasons, particularly
> for an initial implementation. The set of vectorizable instructions can
> always be expanded later.

responding reverse-order, got to this point, needs a re-read a couple
of times more.

summary is: i agree with you but it cannot be a free-for-all
(hence the Compliancy Levels, which need review)

some design context first:

the bare minimum implementation is fetch-decode-{LOOP}-issue-execute.
the LOOP on register numbers goes directly into the exact same
Register-Hazard Management as if the looping did not exist.

in these naive implementations even elwidth overrides would be
single-issue

thus a simple naive first-implementation may extend by just one
pipeline stage, use byte-writeable regfiles, and call it a day.

the next advancement (ignoring REMAP entirely) is to do (sequential)
batching just like Multi-Issue. in fact exactly like Multi-Issue.
the sequential nature of the looping allows for extremely easy Hazard
Management as long as you convert binary reg#s into unary-encoding:

    rt=3, ra=8, VL=3
=>  rt=0b000111000, ra=0b00011100000000

then detecting Hazard overlaps involves simple AND gates not
massive multi-ported CAMs.

elwidth overrides also end up with Hazard Management down at byte-level
but even here unary-encoding comes to the rescue.

REMAP the next complication simply sits between decode and Hazard
Management, shuffling the offsets *before* dropping it into Hazard
Read/Write tables. [this helps explain why i say that it has to
be Deterministic as this is a critical gate-latency juncture, right
smack in Decode/Issue: if you look up Indexed REMAP you will see that
modifying the GPRs after the svindex instruction is UNDEFINED]

before Multi-Issue Hazard Management tables get so insanely large
(several million gates) that clock speeds above 500 mhz are unattainable
no matter the geometry there are two things to the rescue:
Write-after-Write (aka "Register Renaming") combined with
SVTATE.hphint.

hphint allows *intra-* batch Hazards to be utterly disregarded
*within the batch only* not *inter-* batch, and the renamed
batch gets thrown in a nice sequential order at the available
Function Units.

so that is the gamut / gauntlet of all possible (sane) implementations
based on industry-standard pre-existing Micro-Architectures.


now with that context in mind we may evaluate the proposal.

* the first insight that occurred to me *might* be that it is from
  the perspective of a standard SIMD or standard Cray-Vector ISA.
  can i check whether or not you are thinking in terms of passing
  the entire Vector operation *including VL* down into the pipelines?

  this is a perfectly legitimate implementation, to use e.g. a FSM
  (like Microwatt's FP unit) with an additional for-loop *actually in*
  the FSM itself, and to set up a communications protocol with the
  regfile that not only contains the Reg# RT RA BA BB FRS etc but
  *also the offset index*.  thus when reading/writing to the
  regfile the Function Unit *itself* sends multiple (sequential)
  read/write requests in succession.  even potentially implements its
  own miniature Vector Chaining.
  https://en.m.wikipedia.org/wiki/Chaining_(vector_processing)

  but the key is that Hazard Management *still had to be done* even
  before issuing {Instruction}+{0..VL} down into the Function Unit
  (or {Instruction}+{0..3} {Instruction}+{4..7} {Instruction}+{8..VL}
   to multiple Function Units)

* thus logically the most complex part (not in naive implementations)
  is the Hazard Management and that has to be done anyway

* therefore in order to comply with the spec you *had* to do the hard
  bit (Dependency Matrices) and once done *every* Function Unit
  can use that.

* if a given instruction for any reason is too complex to parallelise
  with the combined context of Multi-Issue *and* Looping then there is
  no problem at all, just fall back to "naive" (single-issue) looping.
  if *really* a problem then the absolute bare-minimum fallback
  is that of single-step (like in debug mode): only allow one live
  instruction at a time.

* good examples where single-issue fallback would be strongly
  advised would be tdi and twi (yes they get Vectorized! they have
  RA and RB as sources!)

* in this light to *stop* specific instructions from being Vectorized
  it actually requires more complex Decoding!  ok, some
  implementations may fire an Illegal Instruction Trap.

* and this brings us neatly onto the SV Compliancy Levels, in effect,
  because there will be certain mimimum levels of implementation
  expected performance within the anticipated categories
  (A/V DSP, GPU/HPC) given that trap-and-emulate will suck pretty
  badly on SVP64, end-users are highly likely to complain.

* bottom line, even if it is logical and sane from a hardware
  implementation perspective to not Vectorize some instructions
  it cannot become a free-for-all just as SFS and SFFS and all
  non-Vectorized Compliancy Levels cannot be a free-for-all,
  they exist for a reason and the exact same logic applies to
  Vectorized space.

* and bear in mind just like in the Vulkan Spec managed by the Khronos
  Group speed is *not* made mandatory, that is to implementors to decide,
  and compete on.  the spec mandatoriness is on *what* is implemented
  so that software developers do not go insane.

* thus the discussion becomes about the SV Compliancy Levels so that
  software (HWCAPS_SVP64_xxxxx) does not end up in total meltdown.

compliancy levels: happy to have constructive input on them
https://libre-soc.org/openpower/sv/compliancy_levels/

regarding Verification: we (RED Semiconductor Ltd) HAVE to have
Compliancy Suites, and they will be FOSS-Licensed (Libre-SOC).
the Test API allows plugging in alternative implementations
including autogenerating standalone Makefiles for static build and
test
Comment 65 Luke Kenneth Casson Leighton 2023-06-06 15:45:42 BST
(In reply to Paul Mackerras from comment #62)
> (In reply to Luke Kenneth Casson Leighton from comment #56)
> > 
> > to that end can i suggest doing a trial-run on one single instruction?
> > (i would recommend addi precisely because it is EXT1xx prefixable)
> > 
> > edit: made a start
> > https://libre-soc.org/openpower/sv/rfc/ls010/trial_addi/
> 
> This has a lot going for it. It doesn't express the looping aspect of
> sv.addi, but that may well be preferable, i.e. we probably want to define
> the looping in the SV chapter and say that the individual instruction
> descriptions just define one iteration.
> 
> (I think you need to change 'if "addi" then' in the RTL to 'if "addi" or
> "sv.addi" then'.)

the logical implication then unfortunately being that even non-Vectorized
instructions need "if addeo or sv.addeo" which will get very tedious
and again give the impression there exists a difference between the
two.

the only one(s) actually different RTL are LD/ST-postinc and
Branch-Conditional, and even there bc is a subset of sv.bc,
with certain (new) immediate-operands set to defaults that
are back-compatible to give bc

which hm gives me the wording i need for clarifying "orthogonal
behaviour"


> On that page you say "a danger of even declaring the existence "sv.addi
> RT,RA,SI" is the assumption that it is different from addi RT,RA,SI". People
> don't get to make assumptions about what sv.addi is; you get to define it
> for them. If people have latitude to make assumptions about it then the spec
> is not precise enough.

interesting. this important insight needs to be in the "spec writer's guide".
Comment 66 Luke Kenneth Casson Leighton 2023-06-06 16:27:29 BST
(In reply to Paul Mackerras from comment #57)
> (In reply to Luke Kenneth Casson Leighton from comment #50)

> > https://libre-soc.org/openpower/sv/remap/?#svstate_remap_area
> > 
> > ```
> >     |32:33|34:35|36:37|38:39|40:41| 42:46 | 62     |
> >     | --  | --  | --  | --  | --  | ----- | ------ |
> >     |mi0  |mi1  |mi2  |mo0  |mo1  | SVme  | RMpst  |
> > ```
> 
> Hmmm, I think a user program will expect the REMAP SPRs not to change
> underneath it even if those SVSTATE bits are zero.

turns out that "if zero equals disabled" not just for SVme...
now i think about it *only* SVme=zero means "REMAP entirely off"
which is nice...

... if SVme=0 *or* if SVSHAPE0-3 are zero, REMAP is disabled.

this allows programs to set up SVSHAPEn confident that nothing bad
will happen, and only when SVme gets set *later* to nonzero does
REMAP activate.

> So it looks like the only way to avoid having to save/restore the REMAP SPRs
> is if there is a facility enable bit for SV (or something like that) that
> would cause an interrupt if the program accesses any SV SPR.

ahhh i love the concept, let me think it through.

first thought: in some way those 5 bits of SVme *are* the "facility
enable" bits (of REMAP rather than the whole of SV).

but in this case you really *do* want to be able to "establish"
SVSHAPE0-3 well in advance: particularly for Indexed-REMAP high
performance implementations want as "advance notice" as they can
possibly get, even to the extent in the chacha20 implementation
the svindex instructions are done *once* as part of "setup".

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_chacha20.py;h=0a0feac1e#l132


second thought, testing beforehand rather than just "mfspr r4, SVSTATE"
and be done with it may actually be more costly!  but five SPRs is a
bit much.

ahhh i know.  a Trap Offset.  instead of 0x700, jump to 0x720 if REMAP is
active (when SVme != 0b00000).  or, jump to an entirely new set:

     REMAP_BASE_SPR[0:48] || 0x0700

just like Raptor's KAIVP SPR. (must do that as an RFC btw)

hmmm it might have to be:

     REMAP_BASE_SPR[0:63] + (0x0700 or other Trap addr)

what do you think?
Comment 67 Paul Mackerras 2023-06-06 23:20:02 BST
(In reply to Luke Kenneth Casson Leighton from comment #65)
> (In reply to Paul Mackerras from comment #62)
> > (I think you need to change 'if "addi" then' in the RTL to 'if "addi" or
> > "sv.addi" then'.)
> 
> the logical implication then unfortunately being that even non-Vectorized
> instructions need "if addeo or sv.addeo" which will get very tedious
> and again give the impression there exists a difference between the
> two.

No, the adde RTL (addeo is in the adde/adde./addeo/addeo. group) should be fine as it is. There is no "if" in its RTL currently. The reason that addi has the "if" is to distinguish between addi and paddi. In the addi case, if you don't want to mention sv.addi you could restructure the RTL as:

  if "paddi" then
    if R=0 then
      ...
    else
      ...
  else
    RT <- (RA|0) + EXTS64(SI)

It should be only those instructions where there is an existing prefixed variant that would need to mention sv.xxx in the RTL (or where sv.xxx is actually different from xxx, of course).

> the only one(s) actually different RTL are LD/ST-postinc and
> Branch-Conditional, and even there bc is a subset of sv.bc,
> with certain (new) immediate-operands set to defaults that
> are back-compatible to give bc
Comment 68 Luke Kenneth Casson Leighton 2023-06-07 03:15:37 BST
(In reply to Paul Mackerras from comment #67)

> No, the adde RTL (addeo is in the adde/adde./addeo/addeo. group) should be
> fine as it is. There is no "if" in its RTL currently. The reason that addi
> has the "if" is to distinguish between addi and paddi. 

indeed.

> In the addi case, if
> you don't want to mention sv.addi you could restructure the RTL as:
> 
>   if "paddi" then
>     if R=0 then
>       ...
>     else
>       ...
>   else
>     RT <- (RA|0) + EXTS64(SI)

ahh that works really well.

> It should be only those instructions where there is an existing prefixed
> variant that would need to mention sv.xxx in the RTL

as they're mutually-exclusive (no PO9-PO1-WordInstr and no PO1-PO9-WordInstr)
that shouldn't occur

> (or where sv.xxx is actually different from xxx, of course).

the only one is sv.bc (ok and sv.bcctr etc.) and i would recommend
it be done as entirely separate RTL, otherwise it's going to get
very odd.
Comment 69 Luke Kenneth Casson Leighton 2023-06-08 05:24:21 BST
(In reply to Paul Mackerras from comment #58)

> All of that says to me that the pair of words looks like a prefixed
> instruction and acts like a prefixed instruction. So why not just call it
> one?

i was looking for the right phraseology, to protect the Decode
Phase from having arbitrary opcodes dropped into it, and also
to ensure Orthogonality (not the same as "high performance",
see comment #64)

basically: agreed, yes, it is much better to call it a prefix.

i managed to work it out: it is actually the RTL and
the operands that must not change (nor the Reserved areas).
instructions *have* to be added (even if performance sucks)
to both the prefixed and unprefixed area with the exact same
RTL and operands, or not at all. even Unvectorized ones.

if there are any exceptions to that they must go through the
same Compliancy Level optional groupings as the Scalar set,
for exactly the same reasons that the Compliancy Subsets exist:
to avoid software chaos.


> > the problem with tagging is that it becomes part of the Architectural
> > State (an SPR or in this case *group* of SPRs), which massively
> > complexifies simulators debuggers etc.
> > but also context-switch becomes absolute hell.
> 
> Right.
> 
> Thinking about this in a way that requires some kind of state to be set by
> the execution of the first word which then affects what the second word
> does, really does seem to me to add complexity and not aid comprehension.

it is a variant of ARM SVE's MOVPRFX
https://www.google.com/search?q=movprfx+sve

but there they use actual GPRs as the "intermediate state" that
may be elided in high-performance designs.  aka "macro-op fused",
the defining characteristic of which is that the output from
instruction (1) is both the source *and destination* of
instruction (2).

(in SVP64 it would be the 24 RM bits that is written by the
1st "instruction" - prefix - then read in the "second" - suffix -
and overwritten to zero again in the second. and to an SPR not
a GPR).


> Any kind of hidden state tends to create difficulties (I think the only
> hidden state we have in the ISA at the moment is the reservation), and if
> you expose that state then you make it possible to be set by other means,
> which opens another whole can of worms.

macro-op fusion is pretty common in RISC ISA implementations, because
of the inefficiency involved in running the individual instructions.
fusing makes one Write-after-Write, one Read-after-Write *and* one
Write-after-Read all disappear, and one (internal, micro-coded)
Function Unit and Reservation Station is needed instead of two.
even an in-order system benefits because there is less in the
pipelines (one op not 2).

in general having the state saveable greatly decreases implementation
complexity, and SVP64 is no different in that regard.

 
> Having an actual SPR to expose that intermediate state would be a bad idea
> in my opinion because it would destroy the property of being able to tell
> what registers are read and written just from the instruction word(s). I
> realize that SV already lacks that property for any vectorized instruction,
> but making that intermediate state explicit, and settable by other means
> (e.g. mtspr) would destroy that property for any instruction with register
> operands (i.e. almost all of them).

true, but what really clinched it for me - why i didn't add it - is that
it is yet another SPR to save/restore.  i wanted SV to be only one
SPR (okok REMAP aside).  when SVSTATE was 32-bit in an early draft
it would have been possible to save the RM 24-bits. by the time the
dust settled only 4 bits remain spare in SVSTATE.
Comment 70 Luke Kenneth Casson Leighton 2023-06-08 05:35:00 BST
(In reply to Paul Mackerras from comment #58)

> * If you started to execute the function identified by the two words
> together and then wanted to take an interrupt before you were finished, you
> would need to set SRR0 to point to the first word, not the second. But if
> you think of the two words as two separate instructions, you would naturally
> set SRR0 to point to the second word, which would be wrong.

remember i was previously expecting SVSTATE to be a peer of MSR
and PC, and that SVSTATE would be saved in SVSRR1 exactly like
SRR0 and SRR1. if joined by SVSRR2 and by a save-sv-rm-24-bits
SPR then this issue is solved.  but it is costly.

when we last discussed this (eek a year ago?) and you persuaded me
that SVSRR1 was not needed (just avoid using SVP64 instructions
in context-switches), i did not anticipate the issue you raise above.

which eliminates simple implementations (treating the two as separate
instructions), they *have* to prohibit interrupts if reading only
one 32-bit word per clock cycle. slightly annoying but not the end of
the world.
Comment 71 Jacob Lifshay 2023-06-08 05:53:46 BST
(In reply to Luke Kenneth Casson Leighton from comment #70)
> which eliminates simple implementations (treating the two as separate
> instructions), they *have* to prohibit interrupts if reading only
> one 32-bit word per clock cycle. slightly annoying but not the end of
> the world.

they can still be treated as separate instructions, the cpu just has to disable all interrupts and single-stepping between them, as well as counting differently with performance counters.

x86 does a similar thing for the mov ss, reg instruction (move to stack segment selector):
https://www.felixcloutier.com/x86/mov

> Loading the SS register with a MOV instruction suppresses or inhibits some debug
> exceptions and inhibits interrupts on the following instruction boundary.
> (The inhibition ends after delivery of an exception or the execution of the
> next instruction.) This behavior allows a stack pointer to be loaded into the
> ESP register with the next instruction (MOV ESP, stack-pointer value) before an
> event can be delivered.

(luke, have we defined how SVP64 instructions interact with performance counters around what counts as an instruction? e.g. when VL=5, does
sv.add *r3, *r30, r0
count as 1 instruction, 5 instructions, 6 instructions, implementation-specific number of instructions, or something else?)
Comment 72 Luke Kenneth Casson Leighton 2023-06-08 21:17:41 BST
(In reply to Jacob Lifshay from comment #71)

> x86 does a similar thing for the mov ss, reg instruction (move to stack
> segment selector):
> https://www.felixcloutier.com/x86/mov

oooOoo :}

"Loading the SS register with a MOV instruction suppresses or inhibits some debug
exceptions and inhibits interrupts on the following instruction boundary.
(The inhibition ends after delivery of an exception or the execution of the
next instruction.)"

oo this is good wording.

prompted me to look up rep again
https://www.felixcloutier.com/x86/rep:repe:repz:repne:repnz

and now i know what to look for, the rep prefix allows interrupts
in between but does *precisely* the same kind of "register" decrementing
that SVSTATE.srcstep (etc) is.

> (luke, have we defined how SVP64 instructions interact with performance
> counters around what counts as an instruction? e.g. when VL=5, does
> sv.add *r3, *r30, r0
> count as 1 instruction, 5 instructions, 6 instructions,
> implementation-specific number of instructions, or something else?)

sigh no, yet another thing on the todo list. good catch
Comment 73 Luke Kenneth Casson Leighton 2023-06-11 23:25:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #72)

> > (luke, have we defined how SVP64 instructions interact with performance
> > counters around what counts as an instruction? e.g. when VL=5, does
> > sv.add *r3, *r30, r0
> > count as 1 instruction, 5 instructions, 6 instructions,
> > implementation-specific number of instructions, or something else?)
> 
> sigh no, yet another thing on the todo list. good catch

thoughts: very simple.  instructions == looped therefore they are instructions.
just like MOV SS and REP.  there technically *are* no Vector operations at all.
therefore count suffix only.  bit weird, but consistent.
Comment 74 Luke Kenneth Casson Leighton 2023-06-17 01:03:43 BST
Ah.  how about not adding a "Vector" Extension at all, just adding
a Prefix instruction:

SVPrefix  SV-Form

    |  0    5   | 6     29 | 30 31 |
    |   PO9     | SVRM     | svmode|

The SVPrefix instruction uses the following
Defined Word-instruction as a template for repetition in a loop.
(edit: "For formats see separate sections")

The SVPrefix instruction suppresses or inhibits some debug
exceptions and inhibits interrupts on the following instruction
boundary.

Hardware Engineering note: Implementations are possible from the
simplest Embedded Finite State Machine all the way to full Multi-Issue
Out-of-Order.  Multi-Issue systems should pay attention to
`SVSTATE.hphint` (reference-link) in order to greatly reduce hardware
complexity on Hazard Management. See section on SVSTATE SPR.

Special registers altered:

    None
Comment 75 Luke Kenneth Casson Leighton 2023-10-09 20:55:06 BST
the results of this bugreport *is* the feedback and discussion in
the comments, above. some have been implemented already however
with the purpose *being* to get feedback, this bug is now closed
as resolved. a new "future" bug (imolementing the near-total rewrite)
is for bug #1179.