questions and feedback needed on ls002 actions: * TODO: remove "mv" replace with "m" * TODO: add table to RFC https://bugs.libre-soc.org/show_bug.cgi?id=944#c4 * TODO: fmis fmi2 fmi3 fmi4 https://bugs.libre-soc.org/show_bug.cgi?id=944#c5 * TODO: add note that fmi2/fishmv will change non-f32-in-f64 values, just like stfs * TODO: add note that fmi2 is a move-type instruction so never causes exceptions and is intended (along with fmis) as a replacement for f32 load from constant so therefore is capable of producing sNaN results. * TODO: redefine in terms of xxspltidp, see comment #20 Public-v3.1 Chap 7 p991
comments carried over: * m not mv - https://bugs.libre-soc.org/show_bug.cgi?id=944#c3 * requested twice "fli*" - https://bugs.libre-soc.org/show_bug.cgi?id=944#c5
original comment https://bugs.libre-soc.org/show_bug.cgi?id=944#c5 please reply here (In reply to Jacob Lifshay from https://bugs.libre-soc.org/show_bug.cgi?id=944#c5) > i think it would be appropriate to rename the instructions like so: > fmvis -> fmis (floating move immediate shifted, like addis) that would come with a corresponding expectation of... oh wait it does indeed shift the lower half up, doesn't it? that's a hilarious quirk of BF16. > fishmv -> either fmil (floating move immediate lower) or fmi2 (floating move > immediate 2nd-half/part-2 -- allows future extension for fmi3/fmi4 for full > bfp64 if desired) hmm hmm i thought initially "yep you don't wanna do that because of pfmis and pfmil", but i just realised that SVP64 would be punished for it [no 96-bit instructions allowed]. so yes i like it.
Has thought been given about what fishmv should do in exception conditions? What if FRS contains a Signalling NaN initially? Should fishmv generate a VXSNAN exception in that case? Or do you consider this to be a move-type instruction (like fmr) which never generates exceptions? Using SINGLE() and DOUBLE() in the RTL for fishmv means that overall the effect is like doing stfs, sth (to the lower 16 bits), followed by lfs. One of the things about stfs is that if the source FPR doesn't contain a valid single-precision value (for example, a finite value that is too large to be represented in single precision), the value stored to memory is essentially garbage (the upper bits of the exponent and lower bits of the mantissa are discarded). Are you OK with fishmv generating some result which is very different from the input value in this kind of case? (If so, a programming note warning about the possibility is probably appropriate.)
> Are you OK with fishmv generating some result which is very different from the input value > in this kind of case? (If so, a programming note warning about the possibility is probably > appropriate.) IMHO, the result should not be different from the input value. The idea behind the two instructions is to be able to load constants in a fast way. As you say it's a move-type instruction, if the user loads NaN/infinity values then we have to assume it's on purpose. Any exception should be thrown later, when an actual operation is involved. For that matter, if there was a unit test using NaN/infinity inputs for other functions (eg trig/exp/log) involving fmvis/fishmv to load the constants, I would expect the input values to remain unchanged.
i have a vague recollection of javascript engines in browsers actually using explicit NaNs for some purpose. basically yes direct explicit storage.
(In reply to Konstantinos Margaritis (markos) from comment #4) > > Are you OK with fishmv generating some result which is very different from the input value > > in this kind of case? (If so, a programming note warning about the possibility is probably > > appropriate.) yes, fishmv is intended to be used only with f32 values (or more specifically, all possible results of fmvis which is a subset of f32 values), if it isn't a valid f32-in-f64 then i'm fine with values changing because behaving like a bfp32 store/write 16-bits/bfp32 load is exactly what's intended. > IMHO, the result should not be different from the input value. i explained the misunderstanding in the meeting today, konstantinos was worried it would give different values than encoded in the immediates of a fmvis/fishmv pair, but it doesn't do that, it correctly handles all possible f32 bitpatterns.
(In reply to Paul Mackerras from comment #3) > Has thought been given about what fishmv should do in exception conditions? > What if FRS contains a Signalling NaN initially? Should fishmv generate a > VXSNAN exception in that case? Or do you consider this to be a move-type > instruction (like fmr) which never generates exceptions? this is a move-type instruction therefore never generates exceptions. this is specifically necessary because otherwise fmvis/fishmv wouldn't be able to encode all possible f32 sNaNs and end up with the exact same sNaN in the register. fmvis/fishmv is intended as a replacement for doing a f32 load from a constant, so being able to express sNaNs is necessary
(In reply to Jacob Lifshay from comment #7) > (In reply to Paul Mackerras from comment #3) > > Has thought been given about what fishmv should do in exception conditions? > > What if FRS contains a Signalling NaN initially? Should fishmv generate a > > VXSNAN exception in that case? Or do you consider this to be a move-type > > instruction (like fmr) which never generates exceptions? > > this is a move-type instruction therefore never generates exceptions. this > is specifically necessary because otherwise fmvis/fishmv wouldn't be able to > encode all possible f32 sNaNs and end up with the exact same sNaN in the > register. > > fmvis/fishmv is intended as a replacement for doing a f32 load from a > constant, so being able to express sNaNs is necessary Sounds reasonable. An explicit statement about that in the instruction description would be useful.
(In reply to Jacob Lifshay from comment #6) > (In reply to Konstantinos Margaritis (markos) from comment #4) > > > Are you OK with fishmv generating some result which is very different from the input value > > > in this kind of case? (If so, a programming note warning about the possibility is probably > > > appropriate.) > > yes, fishmv is intended to be used only with f32 values (or more > specifically, all possible results of fmvis which is a subset of f32 > values), if it isn't a valid f32-in-f64 then i'm fine with values changing > because behaving like a bfp32 store/write 16-bits/bfp32 load is exactly > what's intended. Right. Perhaps a programming note warning about the possibility could be added. There is nothing to stop people doing fishmv on any arbitrary double-precision value (nothing says you can only use it on the result of fmvis), and getting a value very different from the input value might be surprising. > > IMHO, the result should not be different from the input value. > > i explained the misunderstanding in the meeting today, konstantinos was > worried it would give different values than encoded in the immediates of a > fmvis/fishmv pair, but it doesn't do that, it correctly handles all possible > f32 bitpatterns. Right.
(In reply to Paul Mackerras from comment #8) > Sounds reasonable. An explicit statement about that in the instruction > description would be useful. (In reply to Paul Mackerras from comment #9) > Right. Perhaps a programming note warning about the possibility could be > added. There is nothing to stop people doing fishmv on any arbitrary > double-precision value (nothing says you can only use it on the result of > fmvis), and getting a value very different from the input value might be > surprising. added todos in the top comment for both of these
notes from meeting, in order to fit in po9 i think fmis/fmi2 should be replaced with 1 64-bit non-vectorizable instruction that loads a f32 to fpr po9 frt xo imm32 frt <- DOUBLE(imm32)
(In reply to Jacob Lifshay from comment #11) > notes from meeting, in order to fit in po9 i think fmis/fmi2 should be > replaced with 1 64-bit non-vectorizable instruction that loads a f32 to fpr > > po9 frt xo imm32 > > frt <- DOUBLE(imm32) I totally agree with this suggestion. Currently in order to load a 64-bit constant, using fmis/fmi2, I have to load 4x16-bit immedates, do 2x shifts and 2x OR to produce the intermediate 32-bit values, then do another shift and OR to produce the final constant. Such an fmvi instruction will reduce this to just a single shift and OR.
(In reply to Jacob Lifshay from comment #11) > notes from meeting, in order to fit in po9 i think fmis/fmi2 should be > replaced with 1 64-bit non-vectorizable instruction that loads a f32 to fpr elwidth=FP16/FB16 has to be taken into consideration.
(In reply to Luke Kenneth Casson Leighton from comment #13) > (In reply to Jacob Lifshay from comment #11) > > notes from meeting, in order to fit in po9 i think fmis/fmi2 should be > > replaced with 1 64-bit non-vectorizable instruction that loads a f32 to fpr > > elwidth=FP16/FB16 has to be taken into consideration. that can be done with: addi r3, 0, 0x1234 sv.fmtg/w=f16 *f5, r3
(In reply to Jacob Lifshay from comment #14) > (In reply to Luke Kenneth Casson Leighton from comment #13) > > (In reply to Jacob Lifshay from comment #11) > > > notes from meeting, in order to fit in po9 i think fmis/fmi2 should be > > > replaced with 1 64-bit non-vectorizable instruction that loads a f32 to fpr > > > > elwidth=FP16/FB16 has to be taken into consideration. > > that can be done with: > addi r3, 0, 0x1234 > sv.fmtg/w=f16 *f5, r3 alternatively fmi could be a 64-bit scalar instruction with a 32-bit immediate and a 2-bit mode field: fmi frt, mode, imm32 if mode = 0b00 then FRT <- DOUBLE(imm32) # load f32 as f64 else if mode = 0b01 then # load full f64 FRT <- SINGLE((FRT)) || imm32 else if mode = 0b10 then # load f32/f16/bf16 in low half FRT <- [0] * 32 || imm32 load f32 as f64: fmi f3, 0, 0x12345678 # loads DOUBLE(0x12345678) load f64: fmi f3, 0, 0x12345678 fmi f3, 1, 0x9abcdef0 # loads 0x123456789abcdef0 load f32/f16/bf16: fmi f3, 2, 0x12345678 # loads 0x0000000012345678
(In reply to Jacob Lifshay from comment #15) > if mode = 0b00 then > FRT <- DOUBLE(imm32) # load f32 as f64 > else if mode = 0b01 then > # load full f64 > FRT <- SINGLE((FRT)) || imm32 > else if mode = 0b10 then > # load f32/f16/bf16 in low half > FRT <- [0] * 32 || imm32 else if mode = 0b11 then # load repeating pattern -- useful for subvectors or other misc. stuff. FRT <- imm32 || imm32
(In reply to Jacob Lifshay from comment #14) > > elwidth=FP16/FB16 has to be taken into consideration. > > that can be done with: > addi r3, 0, 0x1234 > sv.fmtg/w=f16 *f5, r3 that's 3x 32-bit words instead of 2x 32-bit words, for both SVP64 and SVP64Single (and it wastes a GPR). how can elwidth=FP16/BF16 be supported *by fmi*. (including SVP64Single/fmi)
(EDIT: continue in bug #1116) https://libre-soc.org/openpower/sv/normal/ | 0-1 | 2 | 3 4 | description | | ------ | --- |---------|----------------------------------| | 0 0 | 0 | dz sz | simple mode | | 0 0 | 1 | RG 0 | scalar reduce mode (mapreduce) | | 0 0 | 1 | / 1 | reserved | | 1 0 | N | dz sz | sat mode: N=0/1 u/s | | VLi 1 | inv | CR-bit | Rc=1: ffirst CR sel | | VLi 1 | inv | zz RC1 | Rc=0: ffirst z/nonz | there's room in that (just) for a bit that says "immediates are Vectorised". ok: using mode[4] says "immediates are Vectorised". that still leaves mode[3] for some sort of decision. the neat thing about this is that even sv.addi can load an array of immediates. oris as well. as we discussed yesterday it requires an "Unconditional Branch" effect, and i'd recommend it be on MAXVL not VL. also to round-up to the nearest 4-bytes. if RM."immediate-mode": NIA = CIA + CEIL(MAXVL * sizeof(immediate), 4) jacob you mentioned during the meeting that this would be "slow" i.e. dependent on Architectural State (SVSTATE), if someone modified SVSTATE with mtspr then things get slow: this is *already* in the spec.
(In reply to Jacob Lifshay from comment #15) > alternatively fmi could be a 64-bit scalar instruction with a 32-bit > immediate given that: * the process for justifying 64-bit instructions (EXT1xx) is very hard * to add this proposal it would be necessary to squeeze scalar (Unvectorized) into PO9 * that is more than PO9 can bear (already close to the encoding limit) * multi-issue, remember: the Apple M1 sustains 100% because all instructions are 32-bit. we have to find another way. i would be fine with fli2 being a "shift and append", and even reducing down to just the one instruction fli where if FRA=0 it causes a reset (clearing) of FRT. prevstuff <- (FRA|0) FRT <- prevstuff << 16 || imm and be done with it. repeated calls DoTheRightThing. does this sound sane?
Here is a suggestion which you might like to consider, and which might be more easily accepted by the IBM architects (no guarantees, I don't know what they will think of it). If you look at xxspltidp, it is actually quite close to what you want. It is a prefixed instruction (using a PO1 prefix) which has 32 bits of immediate field containing a single-precision value, which is converted to double precision and stored into the two halves of a VSR. My suggestion is to define a "floating load immediate single-precision" (flis) instruction which differs from xxspltidp in only one bit in its instruction encoding, specifically that bit 12 of the suffix would be 1 in flis vs. 0 in xxspltidp. The difference in behaviour would be that flis would test MSR.FP rather than MSR.VSX, it would only address the FPRs (so bit 15 of the suffix is always 0), and it would only write one value (vs. two copies for xxspltidp). This should add minimal extra work for implementations which have VSX already, and it wouldn't use up any PO9 opcode space.
(In reply to Paul Mackerras from comment #20) you accidentally removed payment info, i restored it.
(In reply to Paul Mackerras from comment #20) > My suggestion is to define a "floating load immediate single-precision" > (flis) instruction which differs from xxspltidp in only one bit in its > instruction encoding, > This should add minimal extra work for implementations which have VSX > already, and it wouldn't use up any PO9 opcode space. that sounds like a good idea to me! though, one minor issue I have is the lack of support for subnormal values implied by being based on xxspltidp which declares subnormals to be UB.
(In reply to Paul Mackerras from comment #20) > Here is a suggestion which you might like to consider, and which might be > more easily accepted by the IBM architects (no guarantees, I don't know what > they will think of it). appreciated > If you look at xxspltidp, it is actually quite close to what you want. this is a good goal to have, to make SFFS stand-alone independent, but also as you suggest bear in mind that the intention was to fit VRs on top of FP and "merge" the two groups of instructions, there will have been a lot of thought gone into them. we need to "extract" that merging with the least disruption, this is a valuable insight, thank you Paul. > It is > a prefixed instruction (using a PO1 prefix) which has 32 bits of immediate > field containing a single-precision value, which is converted to double > precision and stored into the two halves of a VSR. > > My suggestion is to define a "floating load immediate single-precision" > (flis) instruction which differs from xxspltidp in only one bit in its > instruction encoding, specifically that bit 12 of the suffix would be 1 in > flis vs. 0 in xxspltidp. ahhh. minimise gates. > The difference in behaviour would be that flis > would test MSR.FP rather than MSR.VSX, it would only address the FPRs (so > bit 15 of the suffix is always 0), and it would only write one value (vs. > two copies for xxspltidp). > > This should add minimal extra work for implementations which have VSX > already, agreed reducing Decode gates and also implementation work is really important. > and it wouldn't use up any PO9 opcode space. this is ironically the one instruction that cannot go into PO9 because it is so crucial to SFFS it is always always crucial to treat PO9 as if it was an entirely independent 32-bit instruction.
quick note: see bug #1116, the idea there is to make it possible for ALL immediate instructions to load a Vector of immediates (*from I-Cache* not D-Cache). (In reply to Luke Kenneth Casson Leighton from comment #19) > i would be fine with fli2 being a "shift and append", and even reducing > down to just the one instruction fli where if FRA=0 it causes a > reset (clearing) of FRT. > > prevstuff <- (FRA|0) > FRT <- prevstuff << 16 || imm i just realised it needs to be.... prevstuff <- (FRA|0) FRT <- imm || prevstuff >> 16 because the immediates are BF16 encoding on first use. is that right? damn no it isn't. it'll have to be: prevstuff <- (FRA|0) prevstuff <- SINGLE2DOUBLE(prevstuff) # optionally result <- imm || prevstuff >> 16 result <- SINGLE2DOUBLE(result) # optionally FRT <- result that may not be accurate, is it at least clear what i am trying to communicate? that: * BF16 with one flis should get loaded in with just the one instruction and *still be useful* * FP32 should be two flis instructions * FP64 with the low mantissa bits all zero should be 3 flis ops * full FP64 should be with four flis ops. no additional instructions, no exceptions raised or required. i have a sneaking suspicion that one extra bit is needed to get the last two steps, i.e. to stop the DOUBLE2SINGLE and SINGLE2DOUBLE conversion.
like all the other bugs in this series (bug #1012) the results of the milestone is the discussion itself (on this bugreport). as this has been done, and feedback received, this bug is being closed and a new bug #1191 opened for continuation of the work. with thanks to Paul Mackerras for coordinating the feedback from IBM and for your insights. our conclusion is that these instructions need to be merged into one which can be repeated, and also extended by the vector-immediate SBP64 extension.