external RFC for bigint instructions. report needed on completion. WARNING, SOME OF THIS WORK PRE-DATES 25 OCT 2022 AND CANNOT RECEIVE FUNDING UNDER 2022-08-051. unfortunately. work dated after 25oct2022 however can.
https://libre-soc.org/irclog/%23libre-soc.2022-10-20.log.html#t2022-10-20T20:12:02 lkcl octavius, the language of the RFC needs to be more "x is added to y and placed into z" 20:12 lkcl the wording of maddld (etc) can be used as a template 20:12 lkcl btw we need a bugreport for it 20:13 lkcl https://bugs.libre-soc.org/show_bug.cgi?id=944 is for ls002 20:13 lkcl so, maddhd says: 20:14 lkcl "The 64-bit operands are (RA), (RB), and (RC). The 20:14 lkcl 128-bit product of the operands (RA) and (RB) is 20:14 lkcl added to (RC). The high-order 64 bits of the 128-bit 20:14 lkcl sum are placed into register RT. 20:14 lkcl " 20:14 lkcl therefore, we literally cut-and-paste that text 20:14 lkcl and 20:14 lkcl add 20:14 lkcl "The low-order 64-bits of the 128-bit sum are placed into register RC" 20:15 lkcl also, you removed the tag 20:17 lkcl which ensures that ls003 is missing from this auto-generated page: https://libre-soc.org/openpower/sv/rfc/ 20:17
https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/rfc/ls003.mdwn;h=1f49463ff167180354770ceb5ddc100de7e41edd;hb=6e6902d0bfa700ccab5d640cb55d6867eff52547#l129 129 RS is implictly defined as the same register as RC. 130 131 *Programmer's Note: 132 As a Scalar Power ISA operation, like `lq` and `stq`, RS=RT+1. RS being defined as the same register as RC when non-svp64-prefixed is *not* how RS is defined currently, currently RS is defined to be RT+1. https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger.mdwn;h=ca6065eb035f370b0d45c1e8e62bf783eb33023d;hb=6e6902d0bfa700ccab5d640cb55d6867eff52547#l163 If we want to change the definition where RS=RC even without svp64, that's fine with me, but that should be done everywhere, rather than just in the rfc. Also, that should have been discussed on the mailing list or bug-tracker (if it was, sorry i forgot). If we don't want to change RS=RT+1, then all mentions of RS=RC need to be deleted from the rfc, except when describing svp64's behavior or if describing an alternative to what we currently have.
(In reply to Luke Kenneth Casson Leighton from comment #1) > lkcl octavius, the language of the RFC needs to be more "x is added to y and > placed into z" 20:12 Thanks for the comment, added the descriptions based on the PowerISA spec. I left the description of the 'divmod2du' as well, as it has additional explanation (somewhat redundant perhaps, but has the additional background). https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=d27c0ce4fb5360c6893ae982ca84e706c63ec3bf https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=5a3a995ab07354c6524ed6a880c97e5205e0046f > lkcl also, you removed the tag 20:17 > lkcl which ensures that ls003 is missing from this auto-generated page: > https://libre-soc.org/openpower/sv/rfc/ 20:17 Yeah, added that back in, didn't know what it was for before. Do I need to run the Makefile in sv/rfc? My pandoc (version 2.2.1) threw an error when trying to generate: --normalize has been removed. Normalization is now automatic. --smart/-S has been removed. Use +smart or -smart extension instead. For example: pandoc -f markdown+smart -t markdown-smart. Try pandoc --help for more information. make: *** [Makefile:4: ls001.pdf] Error 2 (In reply to Jacob Lifshay from comment #2) > If we don't want to change RS=RT+1, then all mentions of RS=RC need to be > deleted from the rfc, except when describing svp64's behavior or if > describing an alternative to what we currently have. This change also threw me off. Didn't see such a definition before.
(In reply to Andrey Miroshnikov from comment #3) > For example: pandoc -f markdown+smart -t markdown-smart. > Try pandoc --help for more information. > make: *** [Makefile:4: ls001.pdf] Error 2 Just figured it out, I was trying to use "-smart" as a separate option. Using "-f markdown-smart" and removing --smart and --normalize (default in this pandoc version) fixed it for me.
oh, also, i just noticed: do we wwnt to have xlen in the rfc? if not, it needs to be replaced with 64. iirc we haven't proposed xlen yet, so imho we probably shouldn't start with this rfc.
(In reply to Jacob Lifshay from comment #5) > oh, also, i just noticed: do we wwnt to have xlen in the rfc? if not, it > needs to be replaced with 64. iirc we haven't proposed xlen yet, so imho we > probably shouldn't start with this rfc. yep good point, that needs to be a MASSIVE one-hit RFC as a patch when the git repo is made public (eventually)
if ls003 isn't submitted yet, we should add maddedus (unsigned bigint * signed word). https://bugs.libre-soc.org/show_bug.cgi?id=817#c56 and also potentially both muledsbu (signed bigint * unsigned word) and muledsbs (signed bigint * signed word) if it can be made to work with signed bigints -- i have an idea for that...more about that below: i was thinking about what I said in https://bugs.libre-soc.org/show_bug.cgi?id=817#c57 and think I have a solution -- basically corrections could be applied to RC to account for the previous multiplication being signed when it shouldn't -- not just adding RC, hence not naming madd*. if encoding space is an issue, imho Rc should be removed from dsld/dsrd as that is imho much less useful. all 4 sign combinations of mule/madde are needed because only the msb word in a signed bigint is signed, lower words are always unsigned. signed a * signed b: a lsb word 0 word 1 word 2 msb word 0 u*u u*u s*u sv.muledsbu b word 1 u*u u*u s*u sv.muledsbu word 2 u*s u*s s*s sv.muledsbs msb unsigned a * signed b: a lsb word 0 word 1 word 2 msb word 0 u*u u*u u*u sv.maddedu b word 1 u*u u*u u*u sv.maddedu word 2 u*s u*s u*s sv.maddedus msb unsigned a * unsigned b: a lsb word 0 word 1 word 2 msb word 0 u*u u*u u*u sv.maddedu b word 1 u*u u*u u*u sv.maddedu word 2 u*u u*u u*u sv.maddedu msb
(In reply to Jacob Lifshay from comment #7) > if ls003 isn't submitted yet, we should add maddedus (unsigned bigint * > signed word). > https://bugs.libre-soc.org/show_bug.cgi?id=817#c56 > and also potentially both muledsbu (signed bigint * unsigned word) and > muledsbs (signed bigint * signed word) if it can be made to work with signed > bigints -- i have an idea for that...more about that below: > > i was thinking about what I said in > https://bugs.libre-soc.org/show_bug.cgi?id=817#c57 and think I have a > solution -- basically corrections could be applied to RC to account for the > previous multiplication being signed when it shouldn't -- not just adding > RC, hence not naming madd*. i figured out the correct code for muledsbu: muledsbu RT, RA, RB, RC # correct for RC being high half of signed-unsigned mul # when we need high half of unsigned mul v[0:XLEN*2-1] <- [0] * XLEN || (RC) if (RC)[0] != 0 then v[XLEN:XLEN*2-1] <- (RB) + (RC) # note only lower XLEN bits <!-- no MULUS, so do it manually --> prod[0:XLEN*2-1] <- [0] * (XLEN * 2) if (RA)[0] != 0 then prod[0:XLEN*2-1] <- -(-(RA) * (RB)) else prod[0:XLEN*2-1] <- (RA) * (RB) v <- v + prod RT <- v[XLEN:XLEN*2-1] RS <- v[0:XLEN-1] i'll figure out muledsbs sunday or monday.
table 13 p1353 1357 1358 Public v3.1, there are 12 spaces available. using a large percentage of them would be unwise.
(In reply to Luke Kenneth Casson Leighton from comment #9) > table 13 p1353 1357 1358 Public v3.1, there are 12 spaces available. > using a large percentage of them would be unwise. yes, hence why I said: (In reply to Jacob Lifshay from comment #7) > if encoding space is an issue, imho Rc should be removed from dsld/dsrd as > that is imho much less useful. one other option is using a spr (ctr? maybe a new 64-bit spr named carry?) instead of rs/rc for those operations that aren't very useful without carrying, such as the muled* and maddedu operations, since the existing maddld/maddhd[u] can just be used instead. maddedus (because it's signed/unsigned mul which doesn't currently exist in PowerISA), divmod2du, dsld/dsrd, and pcdec are quite useful for 128-bit arithmetic without VSX, prefix-code dec/enc, etc. and should still be 4-arg operations.
(In reply to Jacob Lifshay from comment #10) > one other option is using a spr nice idea. greatly reduces opcode pressure but throws RS=R5+MAXVL under a bus > ctr? no. > maybe a new 64-bit spr named carry?) it becomes part of mandatory context-switch state. cannot have too many. share with GF polynomial? and i just realised, for parallel mode (RS=RT+MAXVL) a whopping 127 SPRs are needed. drat.
(In reply to Luke Kenneth Casson Leighton from comment #11) > and i just realised, for parallel mode (RS=RT+MAXVL) > a whopping 127 SPRs are needed. drat. imho that isn't a problem...the instructions just won't support RS=RT+MAXVL mode as it isn't very useful for those specific instructions, hence why I limited it to just those instructions that aren't useful except for bigint stuff, where you always want RS=RC.
(In reply to Jacob Lifshay from comment #12) > (In reply to Luke Kenneth Casson Leighton from comment #11) > > and i just realised, for parallel mode (RS=RT+MAXVL) > > a whopping 127 SPRs are needed. drat. > > the instructions just won't support RS=RT+MAXVL that's not acceptable.
(In reply to Luke Kenneth Casson Leighton from comment #13) > (In reply to Jacob Lifshay from comment #12) > > (In reply to Luke Kenneth Casson Leighton from comment #11) > > > and i just realised, for parallel mode (RS=RT+MAXVL) > > > a whopping 127 SPRs are needed. drat. > > > > the instructions just won't support RS=RT+MAXVL > > that's not acceptable. why not? if they have a spr output instead, then they don't even have RS or RC at all. why force them to implement a mode that isn't very useful for software (it isn't useful because we specifically limited the spr-versions to those ops where users always want that would be the RS=RC mode) when the motivating reason for having that RS=RT+MAXVL mode ceases to be valid?
(In reply to Jacob Lifshay from comment #14) > why not? parallel processing of independent inputs and production of independent parallel results. one single SPR prohibits such programming constructs. the loss of this option is not acceptable. this is non-negotiable and this will be the first and only time that i say that. i cannot cope any longer with the repeated insistent and persistent ignoring of decisions over the past three years, i have had enough.
(In reply to Luke Kenneth Casson Leighton from comment #15) > (In reply to Jacob Lifshay from comment #14) > > > why not? > > parallel processing of independent inputs and production of independent > parallel results. > > one single SPR prohibits such programming constructs. my point was that, for that particular subset of instructions (mostly just signed bigint * [un]signed scalar), support for such programming constructs aren't needed, because software will always want to either use it in dependency-chain mode (where speedups come from wide mul e.g. 256x64->320-bit mul) or can easily use instructions not in that subset to achieve the desired result, hence the restriction isn't preventing any usecases. > > the loss of this option is not acceptable. ok. do note that adde/subfe has the exact same issue, and you're apparently fine with there not being a vector of CA flags... we can just remove Rc from dsld/dsrd if needed to make space. imho signed bigint ops are waay more necessary than Rc on dsld/dsrd since they save a bunch of instructions. for muledsbs (signed bigint * signed word), we may be stuck with the non-independently-parallelizability (sorta, OoO can run multiple vectors in parallel if the queues are big enough) because it's looking like 64 bits isn't enough carry bits, we may need it to also read/write CA: working pseudocode so far: def muledsbs(ra, rb, rc, ca): if not ca: v = rc else: v = zext(rb + rc, xlen) v = sext(ra, xlen) * sext(rb, xlen) + sext(v, xlen) rt = zext(v, xlen) rs = zext(v >> xlen, xlen) ca = sext(ra, xlen) < 0 return rt, rs, ca
(In reply to Jacob Lifshay from comment #16) > for muledsbs (signed bigint * signed word), we may be stuck with the > non-independently-parallelizability (sorta, OoO can run multiple vectors in > parallel if the queues are big enough) because it's looking like 64 bits > isn't enough carry bits, we may need it to also read/write CA: > working pseudocode so far: > > def muledsbs(ra, rb, rc, ca): note that my current plan for how muledsbs is used is: # mul little-endian signed bigint in r4-5 by signed word in r3, result in r4-6 setvl VL=2 addic 0, 0, 0 # clear ca li 6, 0 # clear r6 sv.muledsbs *4, *4, 3, 6 # product now in r4-6 if you can come up with a way to implement muledsbs without needing ca and without instead needing to compute rc // rb, that'd be awesome!
(In reply to Jacob Lifshay from comment #16) > ok. do note that adde/subfe has the exact same issue, i am planning a CR-field-base add for exactly that reason. > for muledsbs (signed bigint * signed word), we may be stuck with the > non-independently-parallelizability (sorta, OoO can run multiple vectors in > parallel if the queues are big enough) because it's looking like 64 bits > isn't enough carry bits, we may need it to also read/write CA: absolutely not. that is 4-in 3-out including XER. please stop spending time on that instruction, i will never put it forward to the OPF ISA WG. for future reference, the hard limit is 3-in 2-out. i will never put forward to the ISA WG anything over that. (the GF polynomial ops are a special case as the entire ALU has a global polynomial which, if changed, causes a multi-cycle stall)
75 # [DRAFT] Double-width Shift Left Doubleword 76 77 VA2-Form 78 80 * dsld. RT,RA,RB,RC (Rc=1) 81 damn damn damn these are also 4-in 3-out which will be such a high cost in the Dependency Hazard Management that it will be impractical to implement. Rc=1 has to be eliminated. this is what independent parallel (RS=RT+MAXVL) result production helps solve: the result-pairs may be analysed (cmpi) and a third instruction (sv.or) used to merge the partial results together. a similar trick will have to be found for the other operations (using sv.subfe)
(In reply to Luke Kenneth Casson Leighton from comment #19) > a similar trick will have to be found for the other operations (using > sv.subfe) sigh hit send then realised that the [as-yet-still-being-thought-about] add/sub-by-taking-carry-from-a-CR-field-and-storing-the-carry-out-in-a-CR-as-well instruction could likely also help
discussion suggests adding signed and unsigned variants
(In reply to Luke Kenneth Casson Leighton from comment #21) > discussion suggests adding signed and unsigned variants as I understand it, the suggestion to add both signed and unsigned variants in the meeting today was for shadd, not for dsld/dsrd (though adding signed variants there too might be a good idea).
under bug #1010 the first Draft of ls003 has been submitted.