Bug 1116 - evaluate, spec, and implement Vector-Immediates in SVP64 Normal
Summary: evaluate, spec, and implement Vector-Immediates in SVP64 Normal
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Andrey Miroshnikov
URL:
Depends on:
Blocks:
 
Reported: 2023-06-10 02:23 BST by Luke Kenneth Casson Leighton
Modified: 2023-08-30 14:15 BST (History)
4 users (show)

See Also:
NLnet milestone: NLnet.2022-08-107.ongoing
total budget (EUR) for completion of task and all subtasks: 2000
budget (EUR) for this task, excluding subtasks' budget: 2000
parent task for budget allocation: 1027
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
jacob=1000 lkcl=1000


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2023-06-10 02:23:27 BST
https://bugs.libre-soc.org/show_bug.cgi?id=1092#c18

the idea is that any sequence of immediates may be repeated
*as a vector* in the instruction stream.  some serious caveats
are needed as this is right at the Decode Phase.

task list:

* DONE: basic concept (comment #6)
* TODO: specification writeup (Andrey to find if there is a suitable
        bug under OPF-ISA-WG Grant, "stage 2" qns&feedback)
* TODO: add the relevant SVSTATE bit to openpower-isa
* TODO: add sv.bc to svbranch.mdwn (which implicitly sets
        the relevant SVSTATE bit), see link below
* TODO: add "redirect" similar to sv.bcctr 
* TODO: add read-cache of immediates when sv.bc is called,
        "special-case" in ISACaller to read all 16-bit blocks
        into a python list
* TODO: when relevant SVSTATE bit is set, get immediate inside
        loop where if index=0 the Suffix-immediate is used otherwise
        from-python-list. read BEFORE the REMAP engine is active,
        sort out REMAP later.
        WARNING THIS REQUIRES AN OVERRIDE PARAMETER IN PowerDecode2.
        an additional "const" source is needed where D DS SI UI all
        constant sources from DecodeAImm are OVERRIDDEN if a bitflag
        is set HIGH. ISAcaller needs to "yield eq" that flag from
        SVSTATE.
* TODO: when the Loop ends (PC moves on, in HF, reset in VF) then
        the relevant SVTATE bit is CLEARED. see persistent SVSTATE
        bit for an inkling of how that works.
* TODO: Horizontal-First Unit test
* TODO: Vertical-First Unit test

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/svbranch.mdwn;hb=HEAD
Comment 1 Luke Kenneth Casson Leighton 2023-06-10 02:45:00 BST
(In reply to Luke Kenneth Casson Leighton from https://bugs.libre-soc.org/show_bug.cgi?id=1092#c18  )

> https://libre-soc.org/openpower/sv/normal/
> 
> | 0-1    |  2  |  3   4  |  description                     |
> | ------ | --- |---------|----------------------------------|
> | 0   0  |   0 |  dz  sz | simple mode                      |
> | 0   0  |   1 |  RG  0  | scalar reduce mode (mapreduce)   |
> | 0   0  |   1 |  /   1  | reserved                         |
> | 1   0  |   N | dz   sz |  sat mode: N=0/1 u/s             |
> | VLi 1  | inv | CR-bit  | Rc=1: ffirst CR sel              |
> | VLi 1  | inv | zz RC1  | Rc=0: ffirst z/nonz              |
> 
> there's room in that (just) for a bit that says
> "immediates are Vectorised".  ok: using mode[4]
> says "immediates are Vectorised".

and given that no immediates are greater than 16-bit, it is
possible to just ignore elwidth overrides here

> that still leaves mode[3] for some sort of decision.

or another mode in future.  best to have mode[3:4]=0b01
and reserve other combinations.

> the neat thing about this is that even sv.addi can load
> an array of immediates.  oris as well.

the *entire pattern* of 5 instructions to load 64-bit immediates
can be Vectorized

    addi rt,0,#nnnn
    addis rt,0,#nnnn
    rldicl rt, 32
    ori rt,0,#nnnn
    oris rt,0,#nnnn

becomes:

    sv.addi/vi rt,0,#nnnn
    ...

for sv.fli/vi (see https://bugs.libre-soc.org/show_bug.cgi?id=1092#c19)
it is a simple matter of inlining multiple instructions.

i would strongly suggest though *not* trying to piss about
with binutils syntax, just have ".long 0xnnnnnnnn" after it.

> as we discussed yesterday it requires an "Unconditional
> Branch" effect, and i'd recommend it be on MAXVL not VL.
> also to round-up to the nearest 4-bytes.

MAXVL allows for dynamic code to *change the number of immediates loaded*
which is extremely important given that this is compile-time static.
 
> if RM."immediate-mode":
> 
>     NIA = CIA + CEIL(MAXVL * sizeof(immediate), 4)

forgot that of course the 1st immediate is already in the instruction.
and set hardcoded to 16

  if RM.normal."vector-immediate-mode":
     NIA = CIA + CEIL((MAXVL-1) * 16, 4)

i think not having to read elwidth here will be *really* important,
otherwise the Decoder has a hell of a job.

it is going to be tough enough to identify that this is
"Unconditional Branch": not only does the suffix need identifying
(to find out if it is RM.normal) but the "vector-immediate-mode"
itself needs decoding...

... oh and *then* the new PC can be calculated.

to that end this is DEFINITELY something that goes into the
"Upper" Compliancy Levels.

> jacob you mentioned during the meeting that this would
> be "slow" i.e. dependent on Architectural State (SVSTATE),
> if someone modified SVSTATE with mtspr then things get
> slow: this is *already* in the spec.

it's that some implementations will have caches of where SVSTATE was,
but others will not.
Comment 2 Jacob Lifshay 2023-06-11 11:52:22 BST
(In reply to Luke Kenneth Casson Leighton from comment #1)
>      NIA = CIA + CEIL((MAXVL-1) * 16, 4)

that's wrong, if you want a vector of 16-bit immediates you want:
extra_immediates = MAXVL - 1
extra_bytes = extra_immediates * 2
extra_words = -((-extra_bytes) // 4)  # ceil div
NIA = CIA + 8 + 4 * extra_words

that said, if we're going to have vector immediates at all, they should also account for subvl. it would also be very nice to account for elwid too since you have to decode a bunch of the prefix and suffix anyway (see note below):

XLEN = max(sw, dw)  # TODO: account for f16/bf16 being 16/16-bit not 8/16-bit
bytes = (XLEN // 8) * subvl * MAXVL
bytes -= 2  # first 2 bytes potentially encoded in 32-bit insn
bytes += 8  # sv prefix + 32-bit insn
bytes = (bytes + 3) & ~3  # round up to words
NIA = CIA + bytes

this allows trivially loading a vector of 64-bit immediates in one instruction -- better than any fli proposed so far.

decoding note: i expect cpus to generally treat a vector load immediate as a unconditional jump -- this means they don't try to read instructions after the load immediate in the same cycle as the load immediate so taking longer to decode the length is perfectly fine since the instruction start prefix-sum tree can just treat it as a 64-bit instruction and clear out all attempted instructions after it, leaving time for the full decoder to decode the correct length and redirect fetch to the correct location.

it can be treated like a jump so the next instruction address gets added to the branch target buffer and the next-pc logic will speculatively fetch from the correct location on the next cycle, even before decoding has started.

demo program:
0x08: ori r10, r10, 5
0x0c: and r10, r11, r10
0x10: sv.addi/w=32 *r3, 0, [0x12345678, 0x9abcdef0]  # vector immediate
0x20: sv.add/w=32 *r3, *r3, *r3
...

demo pipeline with 64-bit fetch width

| cycle | next-pc/BTB  | fetch    | len-decode/tree | decode                |
|-------|--------------|----------|-----------------|-----------------------|
| 0     | 0x08         |          |                 |                       |
|       |              |          |                 |                       |
| 1     | 0x10         | 0x08 ori |                 |                       |
|       | BTB has 0x20 | 0x0c and |                 |                       |
| 2     | 0x20         | 0x10 sv. | 0x08 ori len=4  |                       |
|       |              | 0x14 addi| 0x0c and len=4  |                       |
| 3     | 0x28         | 0x20 sv. | 0x10 sv. len=8  | 0x08 ori     NIA=0x0c |
|       |              | 0x24 add | 0x14 addi len=4 | 0x0c and     NIA=0x10 |
| 4     | ...          | ...      | 0x20 sv. len=8  | 0x10 sv.addi NIA=0x20 |
|       |              |          | 0x24 add len=4  |                       |
| 5     | ...          | ...      | ...             | 0x20 sv.add  NIA=0x28 |
|       |              |          |                 |                       |
Comment 3 Jacob Lifshay 2023-06-11 11:57:18 BST
(In reply to Jacob Lifshay from comment #2)
> it can be treated like a jump so the next instruction address gets added to
> the branch target buffer and the next-pc logic will speculatively fetch from
> the correct location on the next cycle, even before decoding has started.

actually, that doesn't work since it needs to fetch all the immediate bytes too, not just skip over them.
Comment 4 Luke Kenneth Casson Leighton 2023-06-11 14:26:35 BST
(In reply to Jacob Lifshay from comment #2)

> that said, if we're going to have vector immediates at all, they should also
> account for subvl.

*click* yes of course.  hmmm is that in 2 bits that is not affected by
anything?  (as in: can it be picked up from the prefix and *guaranteed*
to be easy to get? yes it can!)

> it would also be very nice to account for elwid too since
> you have to decode a bunch of the prefix and suffix anyway (see note below):

that's exactly why i'm *not* recommending elwidth be part of it,
precisely because it requires the prefix-suffix combination.
the only "decode" needed is "is this instruction Arithmetic type"
and a special (small) PowerDecode can be used (in our implementation).
it's dead-easy to do: just put filters onto a PowerDecode instance
(in this case "unit" from the CSV files) and voila "if unit == ALU/LOGICAL"
you have the information needed (right at the critical point)

at which point some *further* decode of elwidth is needed.

subvl on the other hand is dead-easy.

in the interests of not hampering max CPU speed i'm quite happy for
space to be "wasted" here.

which would make it:

   extra_immediates = MAXVL - 1
   extra_bytes = extra_immediates * 2 * subvl
   extra_words = -((-extra_bytes) // 4)  # ceil div
   NIA = CIA + 8 + 4 * extra_words

> XLEN = max(sw, dw)  # TODO: account for f16/bf16 being 16/16-bit not 8/16-bit

exactly the kind of nightmare that will punish multi-issue :)
that would need *even more* decoding - now detecting FP-Arithmetic
as separate from Logical/ALU - to work out even how to get the elwidth

there is enough dependency already between prefix and suffix,
making both me (and the ISA WG) jittery.

> this allows trivially loading a vector of 64-bit immediates in one
> instruction -- better than any fli proposed so far.

remember that it is absolutely critical that the scalar instructions
remain orthogonal to "when Vectorised".

we *cannot* have "if Scalar then instruction means something else
if Vector it's different".

this is a HARD inviolate rule (where sv.bc is seriously pushing our luck
on that one, and the only way i can think to sell it is that bc is
"subset" behaviour of sv.bc)

what you are suggesting would involve *different* pseudocode for
*all* impacted instructions:

   if "sv.addi" then
       do something different from addi because the immediate is bigger
   else
       the v3.0/v3.1 addi pseudocode here

i just went through that with Paul, took ages to work out what i meant
https://bugs.libre-soc.org/show_bug.cgi?id=1056#c69

changes to the "meaning" of an instruction - requiring "if sv.xxx else"
i am putting my foot down HARD on that.

the consequence is that with neither operands nor scalar-instruction
being any different the *same Decode* may be used for both scalar and
vector, and that's absolutely critical when it comes to high-performance
(speculative) parallel decode.


> decoding note: i expect cpus to generally treat a vector load immediate as a
> unconditional jump --

yyep.  which means it has to be REALLY simple.

> this means they don't try to read instructions after
> the load immediate in the same cycle as the load immediate so taking longer
> to decode the length is perfectly fine since the instruction start
> prefix-sum tree can just treat it as a 64-bit instruction and clear out all
> attempted instructions after it, leaving time for the full decoder to decode
> the correct length and redirect fetch to the correct location.

a neat trick:

parallel speculative decode can be carried out, and if "constants are
misinterpreted as "instructions" they are skipped-over once they are
identified.

everything can be done in parallel and the actual decision deferred.
if fetch is in 64-byte aligned chunks and performs some parallel
decode, then we have to be careful crossing that boundary:

Power v3.1 public Book I Section 1.6 p11 :

    Prefixed instructions do not cross 64-byte instruction
    address boundaries. When a prefixed instruction
    crosses a 64-byte boundary, the system alignment
    error handler is invoked.

so, assuming that the vector-immediate instruction is within such blocks,
if VL is ever greater than 31 we're "in trouble" and at *that* point
the scheme you describe would be activated, but otherwise some speculative
decode is perfectly fine.


> it can be treated like a jump so the next instruction address gets added to
> the branch target buffer and the next-pc logic will speculatively fetch from
> the correct location on the next cycle, even before decoding has started.

awesome :)

> demo program:
> 0x08: ori r10, r10, 5
> 0x0c: and r10, r11, r10
> 0x10: sv.addi/w=32 *r3, 0, [0x12345678, 0x9abcdef0]  # vector immediate
        sv.addi/vi/w=32 ...
> 0x20: sv.add/w=32 *r3, *r3, *r3
> ...

this is a really nice illustrative example. needs expanding so that
it's clear that the 2nd immediate is in 0x18.  and setvl MAXVL=4? 8?

    0x10: PO9 sv prefix
    0x14:     addi (prefixed, contains SI=0x12345678)
    0x18: 0x0000_0000 0x0000_0000 0x0000_0000 0x9abcdef0
    0x1c: 0x0000_0000 0x0000_0000 0x0000_0000 0x0000_0000
    0x20: PO9 sv prefix
    0x24:     add *r3, *r3, *r3

ok so there's room there for up to 8 additional constants.
so MAXVL=9 is perfectly fine (on this example).

> demo pipeline with 64-bit fetch width

IBM has been doing 64 *byte* wide decode!!
(likely a clean aligned chunk of a L1 cache line)
fetch-width will be mad: the POWER9 and POWER10
have those OpenCAPI 25 gigabit SERDES (quantity lots!)
Comment 5 Luke Kenneth Casson Leighton 2023-06-11 14:37:39 BST
(In reply to Jacob Lifshay from comment #3)
> (In reply to Jacob Lifshay from comment #2)
> > it can be treated like a jump so the next instruction address gets added to
> > the branch target buffer and the next-pc logic will speculatively fetch from
> > the correct location on the next cycle, even before decoding has started.
> 
> actually, that doesn't work since it needs to fetch all the immediate bytes
> too, not just skip over them.

hence why i'm scared of introducing extra dependencies, increasing
gate-latency, in parallel-decode (such as elwidth, *especially* BF8/FP16
differences).

and why i said that this should only be in "advanced" implementations,
and/or at lower-clock-rate machines such as 3D GPUs, where ~1.5 ghz max
clock rate will not be a problem.
Comment 6 Luke Kenneth Casson Leighton 2023-06-22 12:08:01 BST
i have some ideas on this to reduce decode complexity.

1. overload sv.b yes really sv.b so that it sets a bit in
   SVSTATE "if the next instruction is prefixed then get
   the immediate-constants from CIA-MAXVL//2"
2. if that bit is set (which should be transient) then
   next instruction clears it.

a small unconditional relative-branch would leave a hole
in the instruction stream, and the nice thing is, it's an
*actual* branch not an implicit one.

widths of immediates are fixed at 16-bit and any unused bits
are IGNORED. no compression or attempts at variable-length
are made.
Comment 7 Luke Kenneth Casson Leighton 2023-08-30 11:56:49 BST
   jacob=1000
   markos=300
   vantosh=300
   andrey=200
   lkcl=200

these budgets are not balanced given the amount of input i will
likely need into this task.  the budget was set at EUR 2,000 at the time
for one person (me) to complete this task relatively quickly.

there is to be no further discussion of potential ideas or
alternatives: this is an *implementation and delivery* task
best done by one person and reviewed by others.

i have updated the TODO list in comment #0, you can see how much
is involved.  best done as a branch.

do NOT attempt to suggest how REMAP should work here. that is a BIG
task on its own that needs significant thought AT THE RIGHT TIME
which is NOT NOW.
Comment 8 Luke Kenneth Casson Leighton 2023-08-30 12:04:15 BST
(In reply to Luke Kenneth Casson Leighton from comment #1)

>     sv.addi/vi rt,0,#nnnn

NOT to be implemented this way.  see comment #6 which **IS** the
implementation.  the bit that is added to SVSTATE is to be added
similar to the "persistence" bit.  the FOLLOWING INSTRUCTION
and the following instruction ONLY shall have its immediates
extended to a Vector, implicitly. there shall no **NO** changes
to Normal/LDST/CRops Modes.

application of REMAP to Vector-Immediates shall be prohibited at
this time (but allowed as usual on registers used)
Comment 9 Luke Kenneth Casson Leighton 2023-08-30 14:15:34 BST
on second thoughts there *may* be a place to squeeze in documentation
budget from the OPF-ISA-WG Grant and leave the "implementation" side
here as outlined at the top of comment #17/