Bug 1092 - OPF RFC ISA WG questions feedback on ls002 float-load-immediate
Summary: OPF RFC ISA WG questions feedback on ls002 float-load-immediate
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL: https://libre-soc.org/openpower/sv/rf...
Depends on: 944
Blocks:
  Show dependency treegraph
 
Reported: 2023-05-26 21:53 BST by Luke Kenneth Casson Leighton
Modified: 2024-03-01 00:16 GMT (History)
6 users (show)

See Also:
NLnet milestone: NLnet.2022-08-051.OPF
total budget (EUR) for completion of task and all subtasks: 2000
budget (EUR) for this task, excluding subtasks' budget: 2000
parent task for budget allocation: 1012
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
donated = { amount = 750, submitted = 2024-01-16, paid = 2024-01-25 } lkcl = { amount = 750, submitted = 2024-01-11, paid = 2024-01-29 } [jacob] amount = 500 submitted = 2024-02-09 paid = 2024-02-29


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2023-05-26 21:53:59 BST
questions and feedback needed on ls002

actions:

* TODO: remove "mv" replace with "m"
* TODO: add table to RFC https://bugs.libre-soc.org/show_bug.cgi?id=944#c4
* TODO: fmis fmi2 fmi3 fmi4 https://bugs.libre-soc.org/show_bug.cgi?id=944#c5
* TODO: add note that fmi2/fishmv will change non-f32-in-f64 values, just like stfs
* TODO: add note that fmi2 is a move-type instruction so never causes exceptions and is intended (along with fmis) as a replacement for f32 load from constant so therefore is capable of producing sNaN results.
* TODO: redefine in terms of xxspltidp, see comment #20
  Public-v3.1 Chap 7 p991
Comment 1 Luke Kenneth Casson Leighton 2023-05-26 21:59:13 BST
comments carried over:

* m not mv - https://bugs.libre-soc.org/show_bug.cgi?id=944#c3
* requested twice "fli*" - https://bugs.libre-soc.org/show_bug.cgi?id=944#c5
Comment 2 Luke Kenneth Casson Leighton 2023-05-26 22:48:29 BST
original comment https://bugs.libre-soc.org/show_bug.cgi?id=944#c5
please reply here

(In reply to Jacob Lifshay from https://bugs.libre-soc.org/show_bug.cgi?id=944#c5)

> i think it would be appropriate to rename the instructions like so:
> fmvis -> fmis (floating move immediate shifted, like addis)

that would come with a corresponding expectation of... oh wait
it does indeed shift the lower half up, doesn't it? that's a
hilarious quirk of BF16.

> fishmv -> either fmil (floating move immediate lower) or fmi2 (floating move
> immediate 2nd-half/part-2 -- allows future extension for fmi3/fmi4 for full
> bfp64 if desired)

hmm hmm i thought initially "yep you don't wanna do that because of
pfmis and pfmil", but i just realised that SVP64 would be punished
for it [no 96-bit instructions allowed].

so yes i like it.
Comment 3 Paul Mackerras 2023-05-31 09:13:32 BST
Has thought been given about what fishmv should do in exception conditions? What if FRS contains a Signalling NaN initially? Should fishmv generate a VXSNAN exception in that case? Or do you consider this to be a move-type instruction (like fmr) which never generates exceptions?

Using SINGLE() and DOUBLE() in the RTL for fishmv means that overall the effect is like doing stfs, sth (to the lower 16 bits), followed by lfs. One of the things about stfs is that if the source FPR doesn't contain a valid single-precision value (for example, a finite value that is too large to be represented in single precision), the value stored to memory is essentially garbage (the upper bits of the exponent and lower bits of the mantissa are discarded).

Are you OK with fishmv generating some result which is very different from the input value in this kind of case? (If so, a programming note warning about the possibility is probably appropriate.)
Comment 4 Konstantinos Margaritis (markos) 2023-05-31 09:21:56 BST
> Are you OK with fishmv generating some result which is very different from the input value
> in this kind of case? (If so, a programming note warning about the possibility is probably
> appropriate.)

IMHO, the result should not be different from the input value. The idea behind the two instructions is to be able to load constants in a fast way. As you say it's a move-type instruction, if the user loads NaN/infinity values then we have to assume it's on purpose. Any exception should be thrown later, when an actual operation is involved. For that matter, if there was a unit test using NaN/infinity inputs for other functions (eg trig/exp/log) involving fmvis/fishmv to load the constants, I would expect the input values to remain unchanged.
Comment 5 Luke Kenneth Casson Leighton 2023-05-31 21:40:18 BST
i have a vague recollection of javascript engines in browsers actually using explicit NaNs for some purpose.

basically yes direct explicit storage.
Comment 6 Jacob Lifshay 2023-05-31 22:15:50 BST
(In reply to Konstantinos Margaritis (markos) from comment #4)
> > Are you OK with fishmv generating some result which is very different from the input value
> > in this kind of case? (If so, a programming note warning about the possibility is probably
> > appropriate.)

yes, fishmv is intended to be used only with f32 values (or more specifically, all possible results of fmvis which is a subset of f32 values), if it isn't a valid f32-in-f64 then i'm fine with values changing because behaving like a bfp32 store/write 16-bits/bfp32 load is exactly what's intended.

> IMHO, the result should not be different from the input value.

i explained the misunderstanding in the meeting today, konstantinos was worried it would give different values than encoded in the immediates of a fmvis/fishmv pair, but it doesn't do that, it correctly handles all possible f32 bitpatterns.
Comment 7 Jacob Lifshay 2023-05-31 22:32:44 BST
(In reply to Paul Mackerras from comment #3)
> Has thought been given about what fishmv should do in exception conditions?
> What if FRS contains a Signalling NaN initially? Should fishmv generate a
> VXSNAN exception in that case? Or do you consider this to be a move-type
> instruction (like fmr) which never generates exceptions?

this is a move-type instruction therefore never generates exceptions. this is specifically necessary because otherwise fmvis/fishmv wouldn't be able to encode all possible f32 sNaNs and end up with the exact same sNaN in the register.

fmvis/fishmv is intended as a replacement for doing a f32 load from a constant, so being able to express sNaNs is necessary
Comment 8 Paul Mackerras 2023-05-31 23:36:40 BST
(In reply to Jacob Lifshay from comment #7)
> (In reply to Paul Mackerras from comment #3)
> > Has thought been given about what fishmv should do in exception conditions?
> > What if FRS contains a Signalling NaN initially? Should fishmv generate a
> > VXSNAN exception in that case? Or do you consider this to be a move-type
> > instruction (like fmr) which never generates exceptions?
> 
> this is a move-type instruction therefore never generates exceptions. this
> is specifically necessary because otherwise fmvis/fishmv wouldn't be able to
> encode all possible f32 sNaNs and end up with the exact same sNaN in the
> register.
> 
> fmvis/fishmv is intended as a replacement for doing a f32 load from a
> constant, so being able to express sNaNs is necessary

Sounds reasonable. An explicit statement about that in the instruction description would be useful.
Comment 9 Paul Mackerras 2023-05-31 23:40:16 BST
(In reply to Jacob Lifshay from comment #6)
> (In reply to Konstantinos Margaritis (markos) from comment #4)
> > > Are you OK with fishmv generating some result which is very different from the input value
> > > in this kind of case? (If so, a programming note warning about the possibility is probably
> > > appropriate.)
> 
> yes, fishmv is intended to be used only with f32 values (or more
> specifically, all possible results of fmvis which is a subset of f32
> values), if it isn't a valid f32-in-f64 then i'm fine with values changing
> because behaving like a bfp32 store/write 16-bits/bfp32 load is exactly
> what's intended.

Right. Perhaps a programming note warning about the possibility could be added. There is nothing to stop people doing fishmv on any arbitrary double-precision value (nothing says you can only use it on the result of fmvis), and getting a value very different from the input value might be surprising.

> > IMHO, the result should not be different from the input value.
> 
> i explained the misunderstanding in the meeting today, konstantinos was
> worried it would give different values than encoded in the immediates of a
> fmvis/fishmv pair, but it doesn't do that, it correctly handles all possible
> f32 bitpatterns.

Right.
Comment 10 Jacob Lifshay 2023-05-31 23:47:35 BST
(In reply to Paul Mackerras from comment #8)
> Sounds reasonable. An explicit statement about that in the instruction
> description would be useful.

(In reply to Paul Mackerras from comment #9)
> Right. Perhaps a programming note warning about the possibility could be
> added. There is nothing to stop people doing fishmv on any arbitrary
> double-precision value (nothing says you can only use it on the result of
> fmvis), and getting a value very different from the input value might be
> surprising.

added todos in the top comment for both of these
Comment 11 Jacob Lifshay 2023-06-06 21:05:53 BST
notes from meeting, in order to fit in po9 i think fmis/fmi2 should be replaced with 1 64-bit non-vectorizable instruction that loads a f32 to fpr

po9 frt xo imm32

frt <- DOUBLE(imm32)
Comment 12 Konstantinos Margaritis (markos) 2023-06-06 21:09:23 BST
(In reply to Jacob Lifshay from comment #11)
> notes from meeting, in order to fit in po9 i think fmis/fmi2 should be
> replaced with 1 64-bit non-vectorizable instruction that loads a f32 to fpr
> 
> po9 frt xo imm32
> 
> frt <- DOUBLE(imm32)

I totally agree with this suggestion.
Currently in order to load a 64-bit constant, using fmis/fmi2, I have to load 4x16-bit immedates, do 2x shifts and 2x OR to produce the intermediate 32-bit values, then do another shift and OR to produce the final constant. Such an fmvi instruction will reduce this to just a single shift and OR.
Comment 13 Luke Kenneth Casson Leighton 2023-06-07 03:18:12 BST
(In reply to Jacob Lifshay from comment #11)
> notes from meeting, in order to fit in po9 i think fmis/fmi2 should be
> replaced with 1 64-bit non-vectorizable instruction that loads a f32 to fpr

elwidth=FP16/FB16 has to be taken into consideration.
Comment 14 Jacob Lifshay 2023-06-07 03:28:11 BST
(In reply to Luke Kenneth Casson Leighton from comment #13)
> (In reply to Jacob Lifshay from comment #11)
> > notes from meeting, in order to fit in po9 i think fmis/fmi2 should be
> > replaced with 1 64-bit non-vectorizable instruction that loads a f32 to fpr
> 
> elwidth=FP16/FB16 has to be taken into consideration.

that can be done with:
addi r3, 0, 0x1234
sv.fmtg/w=f16 *f5, r3
Comment 15 Jacob Lifshay 2023-06-07 03:41:52 BST
(In reply to Jacob Lifshay from comment #14)
> (In reply to Luke Kenneth Casson Leighton from comment #13)
> > (In reply to Jacob Lifshay from comment #11)
> > > notes from meeting, in order to fit in po9 i think fmis/fmi2 should be
> > > replaced with 1 64-bit non-vectorizable instruction that loads a f32 to fpr
> > 
> > elwidth=FP16/FB16 has to be taken into consideration.
> 
> that can be done with:
> addi r3, 0, 0x1234
> sv.fmtg/w=f16 *f5, r3

alternatively fmi could be a 64-bit scalar instruction with a 32-bit immediate and a 2-bit mode field:
fmi frt, mode, imm32

if mode = 0b00 then
    FRT <- DOUBLE(imm32) # load f32 as f64
else if mode = 0b01 then
    # load full f64
    FRT <- SINGLE((FRT)) || imm32
else if mode = 0b10 then
    # load f32/f16/bf16 in low half
    FRT <- [0] * 32 || imm32

load f32 as f64:
fmi f3, 0, 0x12345678  # loads DOUBLE(0x12345678)

load f64:
fmi f3, 0, 0x12345678
fmi f3, 1, 0x9abcdef0  # loads 0x123456789abcdef0

load f32/f16/bf16:
fmi f3, 2, 0x12345678  # loads 0x0000000012345678
Comment 16 Jacob Lifshay 2023-06-07 03:45:20 BST
(In reply to Jacob Lifshay from comment #15)
> if mode = 0b00 then
>     FRT <- DOUBLE(imm32) # load f32 as f64
> else if mode = 0b01 then
>     # load full f64
>     FRT <- SINGLE((FRT)) || imm32
> else if mode = 0b10 then
>     # load f32/f16/bf16 in low half
>     FRT <- [0] * 32 || imm32
else if mode = 0b11 then
    # load repeating pattern -- useful for subvectors or other misc. stuff.
    FRT <- imm32 || imm32
Comment 17 Luke Kenneth Casson Leighton 2023-06-07 18:22:46 BST
(In reply to Jacob Lifshay from comment #14)

> > elwidth=FP16/FB16 has to be taken into consideration.
> 
> that can be done with:
> addi r3, 0, 0x1234
> sv.fmtg/w=f16 *f5, r3

that's 3x 32-bit words instead of 2x 32-bit words,
for both SVP64 and SVP64Single (and it wastes a GPR).

how can elwidth=FP16/BF16 be supported *by fmi*.
(including SVP64Single/fmi)
Comment 18 Luke Kenneth Casson Leighton 2023-06-07 18:37:22 BST
(EDIT: continue in bug #1116)

https://libre-soc.org/openpower/sv/normal/

| 0-1    |  2  |  3   4  |  description                     |
| ------ | --- |---------|----------------------------------|
| 0   0  |   0 |  dz  sz | simple mode                      |
| 0   0  |   1 |  RG  0  | scalar reduce mode (mapreduce)   |
| 0   0  |   1 |  /   1  | reserved                         |
| 1   0  |   N | dz   sz |  sat mode: N=0/1 u/s             |
| VLi 1  | inv | CR-bit  | Rc=1: ffirst CR sel              |
| VLi 1  | inv | zz RC1  | Rc=0: ffirst z/nonz              |

there's room in that (just) for a bit that says
"immediates are Vectorised".  ok: using mode[4]
says "immediates are Vectorised".

that still leaves mode[3] for some sort of decision.

the neat thing about this is that even sv.addi can load
an array of immediates.  oris as well.

as we discussed yesterday it requires an "Unconditional
Branch" effect, and i'd recommend it be on MAXVL not VL.
also to round-up to the nearest 4-bytes.

if RM."immediate-mode":

    NIA = CIA + CEIL(MAXVL * sizeof(immediate), 4)

jacob you mentioned during the meeting that this would
be "slow" i.e. dependent on Architectural State (SVSTATE),
if someone modified SVSTATE with mtspr then things get
slow: this is *already* in the spec.
Comment 19 Luke Kenneth Casson Leighton 2023-06-10 02:06:43 BST
(In reply to Jacob Lifshay from comment #15)

> alternatively fmi could be a 64-bit scalar instruction with a 32-bit
> immediate

given that:

* the process for justifying 64-bit instructions (EXT1xx) is very hard
* to add this proposal it would be necessary to squeeze scalar
  (Unvectorized) into PO9
* that is more than PO9 can bear (already close to the encoding limit)
* multi-issue, remember: the Apple M1 sustains 100% because all
  instructions are 32-bit.

we have to find another way.

i would be fine with fli2 being a "shift and append", and even reducing
down to just the one instruction fli where if FRA=0 it causes a
reset (clearing) of FRT.

    prevstuff <- (FRA|0)
    FRT <- prevstuff << 16 || imm

and be done with it.

repeated calls DoTheRightThing.  does this sound sane?
Comment 20 Paul Mackerras 2023-06-20 02:32:35 BST
Here is a suggestion which you might like to consider, and which might be more easily accepted by the IBM architects (no guarantees, I don't know what they will think of it).

If you look at xxspltidp, it is actually quite close to what you want. It is a prefixed instruction (using a PO1 prefix) which has 32 bits of immediate field containing a single-precision value, which is converted to double precision and stored into the two halves of a VSR.

My suggestion is to define a "floating load immediate single-precision" (flis) instruction which differs from xxspltidp in only one bit in its instruction encoding, specifically that bit 12 of the suffix would be 1 in flis vs. 0 in xxspltidp. The difference in behaviour would be that flis would test MSR.FP rather than MSR.VSX, it would only address the FPRs (so bit 15 of the suffix is always 0), and it would only write one value (vs. two copies for xxspltidp).

This should add minimal extra work for implementations which have VSX already, and it wouldn't use up any PO9 opcode space.
Comment 21 Jacob Lifshay 2023-06-20 02:49:18 BST
(In reply to Paul Mackerras from comment #20)

you accidentally removed payment info, i restored it.
Comment 22 Jacob Lifshay 2023-06-20 03:10:09 BST
(In reply to Paul Mackerras from comment #20)
> My suggestion is to define a "floating load immediate single-precision"
> (flis) instruction which differs from xxspltidp in only one bit in its
> instruction encoding,

> This should add minimal extra work for implementations which have VSX
> already, and it wouldn't use up any PO9 opcode space.

that sounds like a good idea to me! though, one minor issue I have is the lack of support for subnormal values implied by being based on xxspltidp which declares subnormals to be UB.
Comment 23 Luke Kenneth Casson Leighton 2023-06-22 11:43:28 BST
(In reply to Paul Mackerras from comment #20)
> Here is a suggestion which you might like to consider, and which might be
> more easily accepted by the IBM architects (no guarantees, I don't know what
> they will think of it).

appreciated

> If you look at xxspltidp, it is actually quite close to what you want.

this is a good goal to have, to make SFFS stand-alone independent,
but also as you suggest bear in mind that the intention was to
fit VRs on top of FP and "merge" the two groups of instructions,
there will have been a lot of thought gone into them.

we need to "extract" that merging with the least disruption, this is
a valuable insight, thank you Paul.

> It is
> a prefixed instruction (using a PO1 prefix) which has 32 bits of immediate
> field containing a single-precision value, which is converted to double
> precision and stored into the two halves of a VSR.
> 
> My suggestion is to define a "floating load immediate single-precision"
> (flis) instruction which differs from xxspltidp in only one bit in its
> instruction encoding, specifically that bit 12 of the suffix would be 1 in
> flis vs. 0 in xxspltidp.

ahhh. minimise gates.

> The difference in behaviour would be that flis
> would test MSR.FP rather than MSR.VSX, it would only address the FPRs (so
> bit 15 of the suffix is always 0), and it would only write one value (vs.
> two copies for xxspltidp).
> 
> This should add minimal extra work for implementations which have VSX
> already,

agreed reducing Decode gates and also implementation work is really
important.

> and it wouldn't use up any PO9 opcode space.

this is ironically the one instruction that cannot go into PO9 because
it is so crucial to SFFS

it is always always crucial to treat PO9 as if it was an entirely
independent 32-bit instruction.
Comment 24 Luke Kenneth Casson Leighton 2023-06-22 12:02:20 BST
quick note: see bug #1116, the idea there is to make it possible
for ALL immediate instructions to load a Vector of immediates
(*from I-Cache* not D-Cache).

(In reply to Luke Kenneth Casson Leighton from comment #19)

> i would be fine with fli2 being a "shift and append", and even reducing
> down to just the one instruction fli where if FRA=0 it causes a
> reset (clearing) of FRT.
> 
>     prevstuff <- (FRA|0)
>     FRT <- prevstuff << 16 || imm

i just realised it needs to be....

     prevstuff <- (FRA|0)
     FRT <- imm || prevstuff >> 16

because the immediates are BF16 encoding on first use.  is that right?
damn no it isn't.  it'll have to be:

     prevstuff <- (FRA|0)
     prevstuff <- SINGLE2DOUBLE(prevstuff)        # optionally
     result <- imm || prevstuff >> 16
     result <- SINGLE2DOUBLE(result)              # optionally
     FRT <- result

that may not be accurate, is it at least clear what i am trying to
communicate?  that:

* BF16 with one flis should get loaded in with just the one instruction
  and *still be useful*
* FP32 should be two flis instructions
* FP64 with the low mantissa bits all zero should be 3 flis ops
* full FP64 should be with four flis ops.

no additional instructions, no exceptions raised or required.

i have a sneaking suspicion that one extra bit is needed to get the
last two steps, i.e. to stop the DOUBLE2SINGLE and SINGLE2DOUBLE
conversion.
Comment 25 Luke Kenneth Casson Leighton 2023-10-17 13:26:21 BST
like all the other bugs in this series (bug #1012) the results of the milestone
is the discussion itself (on this bugreport). as this has been done, and
feedback received, this bug is being closed and a new bug #1191 opened for
continuation of the work.

with thanks to Paul Mackerras for coordinating the feedback from IBM
and for your insights.  our conclusion is that these instructions need
to be merged into one which can be repeated, and also extended by the
vector-immediate SBP64 extension.