**** PLEASE NOTE ***** the RFP for this bugreport *pre-dates* NLnet's automated system. the record as part of the audit carried out under bug #1171 is consistent. there is a MANUAL record in NLnet's RFP system for Jacob of 2023-06-20: 2023-06-20 EUR 2300 t-2 Jacob Lifshay ***** Todo list: * TODO: update wiki to account for RC being high half of dividend for bigint-word division. * TODO: the EXTRA-bit needs a RM Mode qualifier (binutils as well), and it turns out the default for RS is not quite as in spec. see comment #53 * TODO: add extra-wide bigint/word divmod idea to wiki, described here: https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-October/005374.html https://libre-soc.org/openpower/sv/biginteger/ Giant Discussion on mailing list: https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004761.html Video lkcl made on draft version: https://youtu.be/8hrIG7-E77o My comments on that video (basically looks good except that division uses sv.madded and sv.subfe not sv.msubed and sv.subfe): https://libre-soc.org/irclog/%23libre-soc.2022-04-21.log.html#t2022-04-21T19:38:17 source code of knuth algorithms d and m https://git.libre-soc.org/?p=libreriscv.git;a=tree;f=openpower/sv/biginteger;hb=HEAD unit tests https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/bigint/bigint_cases.py;hb=HEAD#l149

(In reply to Jacob Lifshay from comment #0) > Couldn't find a bug for this so made one. ta. forgot. > My comments on that video (basically looks good except that division uses > sv.madded and sv.subfe not sv.msubed and sv.subfe): > https://libre-soc.org/irclog/%23libre-soc.2022-04-21.log.html#t2022-04-21T19: > 38:17 msubed and madded are *not* the same as what was initially proposed (that was maddx and msubx). msubed and madded are the *new* idea (you came up with) but the divmnu.c source has *not* been updated. i easily worked out the mul case, updated mulmnu.c to match: div i can't handle.

(In reply to Luke Kenneth Casson Leighton from comment #1) > (In reply to Jacob Lifshay from comment #0) > > My comments on that video (basically looks good except that division uses > > sv.madded and sv.subfe not sv.msubed and sv.subfe): > > https://libre-soc.org/irclog/%23libre-soc.2022-04-21.log.html#t2022-04-21T19: > > 38:17 > > msubed and madded are *not* the same as what was initially > proposed (that was maddx and msubx). yup. > > msubed and madded are the *new* idea (you came up with) but the > divmnu.c source has *not* been updated. yeah, as stated on irc, sorry i haven't yet fixed it. > > i easily worked out the mul case, updated mulmnu.c to match: > div i can't handle. yup.

(In reply to Jacob Lifshay from comment #2) > (In reply to Luke Kenneth Casson Leighton from comment #1) > > msubed and madded are *not* the same as what was initially > > proposed (that was maddx and msubx). > > yup. > > > > > msubed and madded are the *new* idea (you came up with) but the > > divmnu.c source has *not* been updated. > > yeah, as stated on irc, sorry i haven't yet fixed it. except that i only came up with madded, msubed is something that you apparently interpolated out of msubx and madded -- msubed isn't needed for the div algorithm since the required negation is already done by subfe.

(In reply to Jacob Lifshay from comment #3) > > yeah, as stated on irc, sorry i haven't yet fixed it. no problem. haven't been upstairs yet to acknowledge, sorry. > except that i only came up with madded, msubed is something that you > apparently interpolated out of msubx and madded it seemed logical at the time! but if not needed that's great, one less instruction and it entirely frees up the right half of EXT04 space. maddld maddhd MADDED maddhdu -- -- -- -- > -- msubed isn't needed for > the div algorithm since the required negation is already done by subfe. nggggh ok makes sense. mul... mul... arg yes of course, RC is used as the 64-bit carry on the *product*, nothing to do with the partial result, yes, that goes into subfe. so of course no mul-subtract. rright, some hatchet-jobs and writeup to do...

https://stackoverflow.com/questions/2661541/picking-good-first-estimates-for-goldschmidt-division goldschmidt division https://libre-soc.org/irclog/%23libre-soc.2022-04-21.log.html#t2022-04-21T22:34:12 https://en.m.wikipedia.org/wiki/Methods_of_computing_square_roots#Goldschmidt%E2%80%99s_algorithm

see also: https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Goldschmidt%E2%80%99s_algorithm another thing worth researching is: can goldschmidt's division algorithm be modified to work for polynomials over gf(2) (carry-less division/remainder)?

(In reply to Jacob Lifshay from comment #2) > yeah, as stated on irc, sorry i haven't yet fixed it. I updated divmnu64.c to include the MADDED_SUBFE variant, as well as fixing a bug in mulmnu.c. I also formatting them and switched the tests to use an array of structs rather than a plain array -- avoiding the need to do a bunch of magic math to get the correct array indexes. I also added an indication to the test cases in divmnu64.c so that it knows which cases are the expected errors allowing it to handle them better.

I implemented a WIP version of goldschmidt division, for the 128x64->64-bit division instructions. I didn't yet figure out the correct values for the additional internal precision needed, but I found a paper describing how to calculate them ... they showed that the parameters used for AMD's K8 were too loose, meaning the K8 has more internal precision than needed, wasting gates. https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=cc2c51cf2cd7b22f7d18871fbfb73e0b7707d91a The paper I found: Even, G., Seidel, P. M., & Ferguson, W. E. (2003). A Parametric Error Analysis of Goldschmidt's Division Algorithm. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.1238&rep=rep1&type=pdf

(In reply to Jacob Lifshay from comment #7) > I updated divmnu64.c to include the MADDED_SUBFE variant, as well as fixing > a bug in mulmnu.c. nice. > I also formatting them and switched the tests to use an array of structs > rather than a plain array -- avoiding the need to do a bunch of magic math > to get the correct array indexes. :) > I also added an indication to the test cases in divmnu64.c so that it knows > which cases are the expected errors allowing it to handle them better. yes it was a little odd format.

245 io_width: int 246 """bit-width of the input divisor and the result. 247 the input numerator is `2 * io_width`-bits wide. 248 """ remember to either split these by newline, or: * shorten the parameter (abbrev wds: wid not width) * use a semicolon * use single-quotes (""" is not a hard requirement) * remove the unnecessary type information * use less words ("the" is the first casualty) the priority is to get these below 80 chars and to get them compact. extra words *do not* increase clarity (just as ironically not *enough* causes ambiguity) example: io_wid; "input and result bit-width. numerator is 2xio_wid" it's a tricky balance.

oink moment, why are we not doing Big-Integer Goldschmidt? :) https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html

https://www.researchgate.net/publication/44384708_Improved_algorithms_and_hardware_designs_for_division_by_convergence apparently larger multipliers are required

(In reply to Luke Kenneth Casson Leighton from comment #11) > oink moment, why are we not doing Big-Integer Goldschmidt? :) > > https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html because goldschmidt's algorithm is much slower at small bigint sizes (small enough that the fastest multiplication algorithm is O(N^2)): for a 2*N x N -> N word division, Algorithm D does approximately 2*N*N word operations -- Goldschmidt Division does around log2(N)*8*N*N word operations. When bigints are bigger, then there are faster software algorithms than either Algorithm D or Goldschmidt Division, such as divide and conquer algorithms (good for medium-length bigints) and Block-Wise Barrett Division (good for large bigints). There's also the important case of a bigint divided by a single word, where the algorithm is basically: uint64_t n[n_size], d; // n has most-significant word in n[n_size - 1] uint64_t carry = 0; for(i = n_size; i > 0; i--) { uint128_t v = ((uint128_t)carry << 64) | n[i - 1]; carry = v % d; n[i - 1] = v / d; } // n is the quotient, carry is the remainder see https://gmplib.org/manual/Division-Algorithms

Replying to: https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html > a new 128/64 scalar divide instruction would be > > dividend[0:(XLEN*2)-1] <- (RA) || (RS) > > where RS=RA+1 just like lq and stq. no...it should instead be: divrem2du rt, ra, rb, rc v = (ra << 64) | rb; d = rc; // copy first in case rt is rc rt = UDIV(v, rc); ra = UREM(v, rc); because that's needed for bigint by word division: uint64_t n[n_size], d; // n has most-significant word in n[n_size - 1] uint64_t carry = 0; for(i = n_size; i > 0; i--) { uint128_t v = ((uint128_t)carry << 64) | n[i - 1]; carry = v % d; n[i - 1] = v / d; } // n is the quotient, carry is the remainder li carry, 0 set_remap_reverse sv.divrem2du n.v, carry.s, n.v, d.s // n is the quotient, carry is the remainder we'd probably also want a divrem2d for signed div/rem since that's also needed for faster 128-bit arithmetic (which is likely way more common than bigints)

(In reply to Jacob Lifshay from comment #14) > no...it should instead be: > divrem2du rt, ra, rb, rc > v = (ra << 64) | rb; > d = rc; // copy first in case rt is rc > rt = UDIV(v, rc); > ra = UREM(v, rc); whoops, i meant: rt = UDIV(v, d); ra = UREM(v, d);

(In reply to Jacob Lifshay from comment #14) > Replying to: > https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html > > a new 128/64 scalar divide instruction would be > > > > dividend[0:(XLEN*2)-1] <- (RA) || (RS) > > > > where RS=RA+1 just like lq and stq. > > no...it should instead be: > divrem2du rt, ra, rb, rc that's a no on any more VA-Form 4-operand math operations, and a hard-no because there's no precedent to follow. madd* etc. set the precedent already so just barely justifies one additional instruction. following lq/stq is sort-of-fine [i may have got the dividend the wrong way round, (RS) || (RA) rather than (RA) || (RS). > we'd probably also want a divrem2d for signed div/rem since that's also > needed for faster 128-bit arithmetic (which is likely way more common than > bigints) again: hard no. there does not exist a divrem scalar instruction therefore justifying the addition of four+ instructions when the space in EXT04 is so precious is not something i am comfortable - at all - putting in front of the OPF ISA WG. when there is opcode pressure like this we simply cannot just blithely and arbitrarily propose instructions. especially when the other options - big-integer goldschmidt - have not been explored. i made a start here: https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divgnu64.c;hb=HEAD

(In reply to Luke Kenneth Casson Leighton from comment #16) > (In reply to Jacob Lifshay from comment #14) > > Replying to: > > https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html > > > a new 128/64 scalar divide instruction would be > > > > > > dividend[0:(XLEN*2)-1] <- (RA) || (RS) > > > > > > where RS=RA+1 just like lq and stq. > > > > no...it should instead be: > > divrem2du rt, ra, rb, rc > > that's a no on any more VA-Form 4-operand math operations, and a hard-no > because there's no precedent to follow. there's sorta precedent on other architectures: the x86 div and idiv instructions that do 128 by 64-bit div/rem. also sparc has 64 by 32-bit divide instructions. > madd* etc. set the precedent > already so just barely justifies one additional instruction. > > following lq/stq is sort-of-fine [i may have got the dividend the wrong > way round, (RS) || (RA) rather than (RA) || (RS). no, because that bigint by word division needs the high half of the numerator to be scalar and the low half to be vector. having an implicit high half means they have to be both scalar or both vector--unusable to do division with a single sv.divrem2du. > > we'd probably also want a divrem2d for signed div/rem since that's also > > needed for faster 128-bit arithmetic (which is likely way more common than > > bigints) > > again: hard no. > > there does not exist a divrem scalar instruction therefore justifying > the addition of four+ two: divrem2du and divrem2d. > instructions when the space in EXT04 is so precious > is not something i am comfortable - at all - putting in front of the > OPF ISA WG. > > when there is opcode pressure like this we simply cannot just blithely > and arbitrarily propose instructions. > > especially when the other options - big-integer goldschmidt - have not > been explored. i already explained that, for smaller bigints (around 8 words), big-integer goldschmidt needs like 30x as many word operations as algorithm d. algorithm d is necessary. > i made a start here: > > https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/ > divgnu64.c;hb=HEAD

(In reply to Jacob Lifshay from comment #14) > Replying to: > because that's needed for bigint by word division: > uint64_t n[n_size], d; // n has most-significant word in n[n_size - 1] > uint64_t carry = 0; > for(i = n_size; i > 0; i--) > { > uint128_t v = ((uint128_t)carry << 64) | n[i - 1]; > carry = v % d; > n[i - 1] = v / d; > } ... oink. argh argh damnit. missed this. that's going to put huge pressure on those 4 remaining VA-Form slots in EXT04. how can that be justified to OPF ISA WG. ok let me think it through. * v is 128 bit constructed from 2 64-bit. * output1 is modulo of divisor * output2 is modulo of divisor by careful selection the same RC trick applies as to bigmul (madded) by carefull-er selection it can also do scalar 128/64 div and/or mod which also makes it useful for Algorithm D. if RC contains carry and RC=0 as well as RT=0 is allowed then selection of whether it *only* performs div or only mod is possible. by having an astounding amount of options i think we have a justification for this precious opcode space. let's see... adapt to swap RA and RC to make it similar to madded, add zero conditional checking... divrem2du rt, ra, rb, rc v = (rc << 64) | rb; d = ra; // copy first in case rt is ra if rt != 0 rt = UDIV(v, d); if RS != 0 RS = UREM(v, d); same types of modes: * RS=RT+VL mode when EXTRA bit 8 is 0 (default for scalar) * RS=RC mode when EXTRA bit 8 is 1 damnit signalling when not to write the remainder is difficult because RS is implicit and can be a source. how about: if rt != ra rt = UDIV(v, d); if RT != 0 RS = UREM(v, d); messy. if there were not 4 slots remaining in the EXT04 VA-Form i'd say use 1 bit to decide if mod is written, and use RT!=0 for div result, but that takes the *entire* space if divrem2ds is added as well. tough call.

(In reply to Jacob Lifshay from comment #17) > > following lq/stq is sort-of-fine [i may have got the dividend the wrong > > way round, (RS) || (RA) rather than (RA) || (RS). > > no, because that bigint by word division needs the high half of the > numerator to be scalar and the low half to be vector. having an implicit > high half means they have to be both scalar or both vector--unusable to do > division with a single sv.divrem2du. yep got it now, was still thinking in terms of the (early) mulx, i hadn't twigged about the feedback loop option just like in madded. > two: divrem2du and divrem2d. times two for one bit to select if mod is written. have to think that through, can it be achieved with RA or RT zero to stop div or rem being stored. > i already explained that, for smaller bigints (around 8 words), big-integer > goldschmidt needs like 30x as many word operations as algorithm d. algorithm > d is necessary. yeah i'm catching up. went to bugzilla and saw 2 posts, my brain was in linux-microwatt mode. i'll need to do a mental walkthrough and also it would be useful to split out in the divmnu64.c you can see i created some functions bigmul bigsub bigadd in divgnu64.c

ok i updated the analysis page, no conditional "if RT != 0" because i realised, if going to do that, might as well use X-Form RT,RA,RB with implied RS=RA+VL and do dividend = (RA) || (RS) and have divqd modqd for 128/64 that reduces pressure on EXT04. now, question is, does signed-div make any sense? it didn't in big-mul because a corrective factor can be done after.

(In reply to Luke Kenneth Casson Leighton from comment #20) > now, question is, does signed-div make any sense? it didn't > in big-mul because a corrective factor can be done after. line 34 https://github.com/hcs0/Hackers-Delight/blob/master/mulmns.c.txt

(In reply to Luke Kenneth Casson Leighton from comment #20) > now, question is, does signed-div make any sense? I think it does because I'm pretty sure it can be used to make a faster 128-bit signed integer div/rem algorithm. I couldn't find any decent examples since I think everyone either implements it in c in terms of u128 div/rem implemented in c (aka slow inefficient because c doesn't have 128 by 64-bit div), or implements it in terms of an assembly u128 div algorithm with a comment that the algorithm should be optimized more...

(In reply to Luke Kenneth Casson Leighton from comment #20) > ok i updated the analysis page, > `divrem2du RT,RA,RB,RC` > > divisor = (RC) || (RB) > dividend = EXTZ128(RA) > RT = UDIV(dividend, divisor) > RS = UREM(dividend, divisor) > > Again, in an SVP64 context, using EXTRA mode bit 8 allows for > selecting whether `RS=RC` or > `RS=RT+VL`. looks good!

how about inverting un and vn if -ve then correcting if signs not equal? inversion of bigint can be done with RB=scalar-zero, subtract RA as vector. in case where n=1 then inversion of scalar v is easy. leaves overflow etc to deal with but not a big deal. am all for avoiding adding unnecessary ops here.

(In reply to Jacob Lifshay from comment #23) > (In reply to Luke Kenneth Casson Leighton from comment #20) > > divisor = (RC) || (RB) > > dividend = EXTZ128(RA) > > RT = UDIV(dividend, divisor) > > RS = UREM(dividend, divisor) > looks good! detecting the overflow conditions need adapting, in fixedarith divdeu it is: dividend[0:(XLEN*2)-1] <- (RA) || [0]*XLEN divisor[0:(XLEN*2)-1] <- [0]*XLEN || (RB) result <- dividend / divisor if (RA) <u (RB) then RT <- result[XLEN:(XLEN*2)-1] overflow <- 0 is it as simple as this: dividend = (RC) || (RB) divisor = EXTZ128(RA) if (((RB) = 0) & ((RC) <u (RA)) | ((RB) != 0) & ((RC)+1 <u (RA))) then RT = UDIV(dividend, divisor) RS = UREM(dividend, divisor) overflow = 0 or is it necessary to do this (a 192-bit compare) if ([0]*XLEN || dividend) <u (divisor || [0]*XLEN) it's looking for the scenario where the result cannot fit into 64 bit, right?

(In reply to Luke Kenneth Casson Leighton from comment #25) > (In reply to Jacob Lifshay from comment #23) > > (In reply to Luke Kenneth Casson Leighton from comment #20) > > > divisor = (RC) || (RB) > > > dividend = EXTZ128(RA) > > > RT = UDIV(dividend, divisor) > > > RS = UREM(dividend, divisor) > > looks good! actually, that's *totally broken*...I just now noticed. you want dividend to be (RC) || (RB), not divisor. they're accidentally swapped. > detecting the overflow conditions need adapting, > in fixedarith divdeu it is: > it's looking for the scenario where the result > cannot fit into 64 bit, right? yeah. I'd just write it as: `divrem2du RT,RA,RB,RC` dividend = (RC) || (RB) divisor = (RA) if divisor == 0 or (RC) >= divisor: overflow = 1 RT = 0xFFFF_FFFF_FFFF_FFFF # saturate .. makes some stuff nicer RS = 0 # pick a value that won't be bigger than RC else: RT = UDIV(dividend, EXTZ128(divisor)) RS = UREM(dividend, EXTZ128(divisor))

(In reply to Luke Kenneth Casson Leighton from comment #25) > is it as simple as this: > > dividend = (RC) || (RB) > divisor = EXTZ128(RA) > if (((RB) = 0) & ((RC) <u (RA)) | > ((RB) != 0) & ((RC)+1 <u (RA))) then the reason you only need to compare RC with RA and don't need to compare RB is because when RC == RA - 1, then all values of RB will still cause the division to round down to fitting in 64-bits. when RC == RA, if RB == 0 that's still big enough to overflow, if RB > 0 then that's even bigger and will definitely overflow.

yep corrected pseudocode, spotted late at night compared to divdeu, updated. RA <u RB makes sense now, RA || fffff..ff / RB still fits (not 0x1_0000_0000) no on non-orthogonal behaviour on overflow, UNDEFINED and has to be that way. otherwise a separate RFC has to be made to get *all other* div ops to be so defined (fffffff) and it's too late for that. bit of a pain.

128 // Compute estimate qhat of q[j] from top 2 digits 129 uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1]; 130 qhat = dig2 / vn[n - 1]; 131 rhat = dig2 % vn[n - 1]; ehm.... ehmehmehm... qhat and rhat are uint64_t (and un/vn 32 bit*) therefore when vn/un are uint64_t[] qhat and rhat must be uint128_t which is NOT happening, we are NOT going to return QTY 4 64 bit registers from divmod2du. thus even on scalar usage if un[j+n] > vn[n-1] an overflow on the 128/64 divide will occur. any clues?

(In reply to Luke Kenneth Casson Leighton from comment #29) > thus even on scalar usage if un[j+n] > vn[n-1] an overflow on the 128/64 > divide will occur. > > any clues? working it out (and this is for n=0): 98 k = 0; // the case of a 99 for (j = m - 1; j >= 0; j--) 100 { // single-digit 101 uint64_t dig2 = (k * b + u[j]); 102 q[j] = dig2 / v[0]; // divisor here. 103 k = dig2 % v[0]; // modulo bak into next loop 104 } * on first entry, k=0 and v[0] must be >0 otherwise div-overflow - therefore RA (0) < RB (v[0]) and we're good. - and RS = modulo v[0] which is *ALWAYS*, by definition, between 0 and v[0]-1 * on second entry therefore, k will ALWAYS be between 0 and v[0]-1 - therefore RA will ALWAYS be below RB (v[0]) so this is good, for the scalar-vector case. but, for that qhat estimate, it goes pearshaped... with the code as-written. if instead however divmnu64.c calls *itself* (because the division estimate is to be on the MSB of v) then it should end up using k=0 lines 98-104 and all will be well

ha. simple. this: 128 // Compute estimate qhat of q[j] from top 2 digits 129 uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1]; 130 qhat = dig2 / vn[n - 1]; 131 rhat = dig2 % vn[n - 1]; becomes a 2-long unrolling of the loop @ line 98, like this: uint64_t dig1 = (uint64_t)un[j + n]; qhat = dig1 / vn[n - 1]; rhat = dig1 % vn[n - 1]; uint64_t dig2 = (rhat << 32) | un[j + n - 1]; qhat = qhat<<32 | dig2 / vn[n - 1]; rhat = rhat<<32 | dig2 % vn[n - 1]; (translate to 128/64 rather than 64/32 of course).

(In reply to Luke Kenneth Casson Leighton from comment #29) > 128 // Compute estimate qhat of q[j] from top 2 digits > 129 uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1]; > 130 qhat = dig2 / vn[n - 1]; > 131 rhat = dig2 % vn[n - 1]; > > ehm.... ehmehmehm... qhat and rhat are uint64_t (and un/vn 32 bit*) > therefore when vn/un > are uint64_t[] qhat and rhat must be uint128_t which is NOT happening, > we are NOT going to return QTY 4 64 bit registers from divmod2du. > > thus even on scalar usage if un[j+n] > vn[n-1] an overflow on the 128/64 > divide will occur. > > any clues? yes...knuth's original algorithm is written in mix assembly where the operation used is a 2-word by 1-word div/rem where both quotient and remainder are 1 word. the only case where that can overflow is where un[j + n] == vn[n - 1] in which case it sets qhat to the max value that fits in a single word (in the 64-bit case, 2**64-1). the initial normalization step creates the invariant that un[j + n] <= vn[n - 1] and that invariant is maintained by the loop. the c code just has qhat and rhat be 2-word values to prevent UB when the overflow occurs, and also for convenience to avoid needing to constantly cast them to 2-word values. they can totally be 1-word values at the cost of putting the initial qhat computation inside of an if that checks for overflow. original algorithm step: D3. [Calculate q̂.] If uⱼ = v₁, set q̂ ← b - 1; otherwise set q̂ ← ⌊(uⱼb + uⱼ₊₁)/v₁⌋. Now test if v₂q̂ > (uⱼb + uⱼ₊₁ - q̂v₁)b + uⱼ₊₂; if so, decrease q̂ by 1 and repeat this test. (The latter test determines at high speed most of the cases in which the trial value q̂ is one too large, and it eliminates all cases where q̂ is two too large; see exercises 19, 20, 21.) in ascii: D3. [Calculate qhat.] If u[j] = v[1], set qhat <- b - 1; otherwise set qhat <- floor((u[j]*b + u[j+1])/v[1]). Now test if v[2]*qhat > (u[j]*b + u[j+1] - qhat*v[1])*b + u[j+2]; if so, decrease qhat by 1 and repeat this test. (The latter test determines at high speed most of the cases in which the trial value qhat is one too large, and it eliminates all cases where qhat is two too large; see exercises 19, 20, 21.)

(In reply to Jacob Lifshay from comment #32) > yes...knuth's original algorithm is written in mix assembly where the > operation used is a 2-word by 1-word div/rem where both quotient and > remainder are 1 word. the only case where that can overflow is where un[j + > n] == vn[n - 1] in which case it sets qhat to the max value that fits in a > single word (in the 64-bit case, 2**64-1). hence the idea of setting to all 1s. i think... given this is a new type of instruction (dual divmod) we can get away with that. elwidth overrides make it useful for other data widths.

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=b6044fdb84bea61887999ffa5e17543fa91b8f89 good, whew. i have to explain all this :) caveat here: divmid2du communicating "overflow", that's... expensive. CR0 and/or OE=1. OE=1 is not going to fit into VA-Form EXT04. each is an extra register. making the operation always "divmid2du." would do the job, but, dang, that's 3-in, 3-out.

(In reply to Luke Kenneth Casson Leighton from comment #34) > https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff; > h=b6044fdb84bea61887999ffa5e17543fa91b8f89 > > good, whew. i have to explain all this :) > > caveat here: divmid2du communicating "overflow", that's... expensive. > CR0 and/or OE=1. OE=1 is not going to fit into VA-Form EXT04. > each is an extra register. we'd want writing to CR0, since it saves an instruction...otherwise you'd need the move OV to CR0 instruction before the conditional branch. > > making the operation always "divmid2du." would do the job, but, dang, > that's 3-in, 3-out. well...is 3-in 3-out more expensive than needing an extra compare instruction in an important use of divmodqdu? i guess we could, instead of writing CR0 or OV, just have it compare like so: cmpl n_hi, d divmodqdu qhat, d, n_lo, n_hi the cmpl takes extra resources...in a non-superscalar system takes an extra clock cycle. we can't put it after the divmodqdu (which would save a clock cycle on non-superscalar systems) since n_hi is overwritten, though otoh we likely need to have another copy of n_hi in a register anyway and cmpl could just use that.

(In reply to Jacob Lifshay from comment #35) > well...is 3-in 3-out more expensive than needing an extra compare > instruction in an important use of divmodqdu? hell yes. that's 6 FFs inside an entire Dependency Matrix Row. > i guess we could, instead of writing CR0 or OV, just have it compare like so: > cmpl n_hi, d > divmodqdu qhat, d, n_lo, n_hi much better / sane

See also: LLVM adding big-int division (fixed-length ints only) to its compiler builtins library, allowing LLVM to correctly compile all int types (even 16384 bit). https://reviews.llvm.org/D120327

Mentioned our work on Knuth's Algorithm D over on LLVM Phabricator where they're currently implementing big-int division: https://reviews.llvm.org/D120327#3493233 I have been helping review that merge request -- they originally had Algorithm D but switched to a slower simpler algorithm, I convinced them that the library API should support replacing the algorithm with Algorithm D again, which requires passing in sufficiently sized buffers and allowing the inputs to be modified.

just realised that another variant of madded, call it say maddd, could be used for high accuracy dotproduct (bug #555). it would be: prod[0:127] = (RA) * (RB) sum[0:127] = (RC || [0]*64) + prod RT <- sum[64:127] RS <- sum[0:63] # RS is either RC or RT+VL but the question is: is it worth doing in the integer space? in FP, using 2 regs to store the intermediate result (FP128), yes. but integer? honestly i have no idea

(In reply to Luke Kenneth Casson Leighton from comment #39) > just realised that another variant of madded, call it say maddd, could be > used for high accuracy dotproduct (bug #555). > > it would be: > > prod[0:127] = (RA) * (RB) > sum[0:127] = (RC || [0]*64) + prod > RT <- sum[64:127] > RS <- sum[0:63] # RS is either RC or RT+VL that isn't what you want...you want the existing maddld instruction, but extended to have rt and rc be 128-bit register pairs. > > but the question is: is it worth doing in the integer space? imho no. the existing integer maddld already returns exact results for dot product, mod 2^64. the only way we might want a different instruction is for 64x64->128-bit dot product, which would be a 4-in 2-out instruction, which imho is too many regs, otherwise we'd also be ok with bigint*word+bigint muladd which is a 4-in 2-out op which you rejected previously. > in FP, using 2 regs to store the intermediate result (FP128), this has the exact same problem as bigint*word+bigint muladd which is a 4-in 2-out op...admittedly it has a better motivation than integer dot product though. another alternative for fp dot product with higher precision is just doing a vector fmul followed by a tree reduction fadd, which retains most of the precision due to values being rounded O(log N) times rather than O(N) times from a serial reduction.

(In reply to Jacob Lifshay from comment #40) > that isn't what you want...you want the existing maddld instruction, but > extended to have rt and rc be 128-bit register pairs. damn. which is 4-in, 2-out. 6 64-bit register hazards is far too much. > another alternative for fp dot product with higher precision is just doing a > vector fmul followed by a tree reduction fadd, which retains most of the > precision due to values being rounded O(log N) times rather than O(N) times > from a serial reduction. like it.

apparently Rust is getting bigint helper functions too: tracking issue: https://github.com/rust-lang/rust/issues/85532 public API: > // On unsigned integers: > > /// `self + rhs + carry` (full adder) > fn carrying_add(self, rhs: Self, carry: bool) -> (Self, bool); > > /// `self - rhs - carry` (full "subtractor") > fn borrowing_sub(self, rhs: Self, carry: bool) -> (Self, bool); > > /// `self * rhs + carry` (multiply-accumulate) > fn carrying_mul(self, rhs: Self, carry: Self) -> (Self, Self); > > /// `self * rhs` (wide multiplication, same as `self.carrying_mul(rhs, 0)`) > fn widening_mul(self, rhs: Self) -> (Self, Self); > > > // On signed integers: > > /// `self + rhs + carry` (full adder) > fn carrying_add(self, rhs: Self, carry: bool) -> (Self, bool); > > /// `self - rhs - carry` (full "subtractor") > fn borrowing_sub(self, rhs: Self, carry: bool) -> (Self, bool);

I added assembly/simulation support for all bigint ops: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=864d526726aa2ed5497ca6ebe4445a021e16ef9a I added unit tests for all bigint operations: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=63530e061b28e868f038c3a5515db50c3fb2b9c8 I renamed madded -> maddedu for consistency with maddhdu: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=c67ae449d2715354fede681f1f8266df3627d211 https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=6ecafc9cfa1cf13f25fa851d012b20989d770108 I also renamed divrem2du -> divmod2du for consistency with mod* instructions: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=ed9c2e70d09321e94f114914a0c5b9acb4a6fc47 https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=2fbac071499ad33d5bfc91d07d768df84b5ec84d

commit f7468f57cdabc5685feaf503f4a5f9b669a35de7 (HEAD -> master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Thu Sep 29 14:57:26 2022 +0100 add carry-roll-over-vector-mul-with-add (!) unit test test_caller_svp64_bigint.py https://bugs.libre-soc.org/show_bug.cgi?id=937 https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=f7468f57cdabc5685feaf503f4a5f9b669a35de7

(In reply to Luke Kenneth Casson Leighton from comment #44) > https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff; > h=f7468f57cdabc5685feaf503f4a5f9b669a35de7 reading through the code, there are 2 issues: * imho the tests should be like test_caller_svp64_fptrans.py and fptrans_cases.py where all the cases are in fptrans_cases.py and are just run by test_caller_svp64_fptrans.py this allows soc.git to easily use those test cases * why are you using addeo?! adde is sufficient and more efficient.

(In reply to Jacob Lifshay from comment #45) > reading through the code, there are 2 issues: > * imho the tests should be like test_caller_svp64_fptrans.py and > fptrans_cases.py where all the cases are in fptrans_cases.py and are just > run by test_caller_svp64_fptrans.py > this allows soc.git to easily use those test cases i know. cut/paste. > * why are you using addeo?! adde is sufficient and more efficient. didn't work. checking p75 3.3.9 book I addic, addic., subfic, addc, subfc, adde, subfe, addme, subfme, addze, and subfze always set CA checked caller.py looks good... tried sv.adde... works perfectly. good catch.

(In reply to Luke Kenneth Casson Leighton from comment #46) > (In reply to Jacob Lifshay from comment #45) > > > reading through the code, there are 2 issues: > > * imho the tests should be like test_caller_svp64_fptrans.py and > > fptrans_cases.py where all the cases are in fptrans_cases.py and are just > > run by test_caller_svp64_fptrans.py > > this allows soc.git to easily use those test cases I moved all tests to bigint_cases.py and changed test_caller_svp64_bigint.py to basically be a copy of test_caller_svp64_fptrans.py I also cleaned up the last few spots where you had sv.addeo in openpower-isa.git and libreriscv.git

(In reply to Jacob Lifshay from comment #47) > I moved all tests to bigint_cases.py and changed test_caller_svp64_bigint.py > to basically be a copy of test_caller_svp64_fptrans.py > I also cleaned up the last few spots where you had sv.addeo in > openpower-isa.git and libreriscv.git ahh thank you.

https://www.felixcloutier.com/x86/idiv did some investigation, divmod2du is basically identical to idivq ELSE IF OperandSize = 64 (* Doublequadword/quadword operation *) temp := RDX:RAX / SRC; (* Signed division *) IF (temp > 7FFFFFFFFFFFFFFFH) or (temp < 8000000000000000H) (* If a positive result is greater than 7FFFFFFFFFFFFFFFH or a negative result is less than 8000000000000000H *) THEN #DE; (* Divide error *) ELSE RAX := temp; RDX := RDE:RAX SignedModulus SRC; FI; FI; which is 3-in, 2-out. * RDX:RAX is the 128 bit divisor (2 64bit regs) * SRC is the dividend (1 64bit reg) * RAX:RDX is the twin div/mod result dividend[0:(XLEN*2)-1] <- (RC) || (RA) divisor[0:(XLEN*2)-1] <- [0]*XLEN || (RB) result <- dividend / divisor modulo <- dividend % divisor RT <- result[XLEN:(XLEN*2)-1] RS <- modulo[XLEN:(XLEN*2)-1]

(In reply to Luke Kenneth Casson Leighton from comment #49) > https://www.felixcloutier.com/x86/idiv > > did some investigation, divmod2du is basically identical to idivq actually it's equivalent to x86's divq, idivq is signed division, divq is unsigned, divmod2du is unsigned. I mentioned divq/idivq back in: https://bugs.libre-soc.org/show_bug.cgi?id=817#c17

(In reply to Jacob Lifshay from comment #50) > actually it's equivalent to x86's divq, idivq is signed division, divq is > unsigned, divmod2du is unsigned. > > I mentioned divq/idivq back in: > https://bugs.libre-soc.org/show_bug.cgi?id=817#c17 brilliant, well spotted, missed that. editing now

added a todo list to the top comment (replicated here for convenience of reading via email): * TODO: update wiki to account for RC being high half of dividend for bigint-word division.

(In reply to Jacob Lifshay from comment #0) > Todo list: > * TODO: update wiki to account for RC being high half of dividend for > bigint-word division. sigh there is a bit more, the EXTRA-bit needs a RM Mode qualifier (binutils as well), and it turns out the default for RS is not quite as in spec.

last demo/test/primitive needed, using sv.divmod: 73 unsigned long bigdiv(unsigned v, unsigned q[], unsigned const u[], int m) 74 { 75 unsigned long long k = 0; // the case of a 76 for (int j = m - 1; j >= 0; j--) 77 { // single-digit 78 unsigned long d2 = (k << 32) | u[j]; 79 q[j] = d2 / v; // divisor here. 80 k = d2 % v; 81 } 82 return k; 83 } this should literally be just the one instruction. the others (sv.shl, sv shr, sv.madded) all have simple demos. fascinatingly for ec25519 which only needs 256 bit arithmetic not even loops are needed. 4x64-bit regs should be enough.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=13dd59db06e7c5ed50d25fd513ff540b72151eb6 sv.divmod2du works, /mrr (reverse-gear) has to be used, to get from high-end down to low-end, exactly as in Knuth Algorithm D.

one thing i encountered while implementing toom-cook multiplication: it would be very handy to have a half-signed version of maddedu, for multiplication of an unsigned bigint by a signed multiplier. this arises because toom-cook multiplication sometimes uses signed multiplication internally, and multiplying one bigint by the msb word of a signed bigint requires unsigned-bigint-signed-word multiplication. proposed: maddedsu RT, RA, RB, RC lhs[0:127] <- ZEXT((RA)) rhs[0:127] <- SEXT((RB)) addend[0:127] <- SEXT((RC)) product[0:127] <- lhs * rhs sum[0:127] <- product + addend RT <- sum[64:127] RS <- sum[0:63] unsigned bigint * signed word: # unsigned bigint input in r32..47 # signed word input in r3 # VL=16 sv.addi 48, 0, 0 sv.maddedsu *32, *32, 3, 48 # product signed bigint in r32..48 (note additional word)

signed-signed multiply probably can't be done with only one sv.maddeds-style instruction because only the msb word in a signed bigint is signed, all lower words are always unsigned. that would likely require different behavior for the msb word compared to all lower words.

(In reply to Jacob Lifshay from comment #56) > proposed: > maddedsu RT, RA, RB, RC should probably actually be named maddedus because RA is unsigned and RB is signed

(In reply to Jacob Lifshay from comment #58) > should probably actually be named maddedus because RA is unsigned and RB is > signed hmmm hmmm these are among the precious EXT04 VA-Form, but should be justifiable: there is mullw/mulhw etc. already. can you do the pseudocode?

(In reply to Luke Kenneth Casson Leighton from comment #59) > (In reply to Jacob Lifshay from comment #58) > > > should probably actually be named maddedus because RA is unsigned and RB is > > signed > > hmmm hmmm these are among the precious EXT04 VA-Form, but should > be justifiable: there is mullw/mulhw etc. already. can you do > the pseudocode? yup, done! I picked XO 57 -- 111001 https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=d7967c3c21fbf1d37f200b9284232aec15a7fadd I also XLEN-ified maddedu and fixed case_sv_bigint_shift_left_then_back while I was at it. the bigint/sv-bigint tests should pass now.

(In reply to Jacob Lifshay from comment #60) > yup, done! fantastic, ah and aunit test, awesome. > I picked XO 57 -- 111001 great, i'll check it in the tables later.

(In reply to Jacob Lifshay from comment #57) > signed-signed multiply probably can't be done with only one sv.maddeds-style > instruction because only the msb word in a signed bigint is signed, all > lower words are always unsigned. that would likely require different > behavior for the msb word compared to all lower words. i have an idea for avoiding needing different behavior: https://bugs.libre-soc.org/show_bug.cgi?id=960#c7

i'm closing this one because ls003 has now been submitted. https://bugs.libre-soc.org/show_bug.cgi?id=1010