Couldn't find a bug for this so made one. https://libre-soc.org/openpower/sv/biginteger/ Giant Discussion on mailing list: https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004761.html Video lkcl made on draft version: https://youtu.be/8hrIG7-E77o My comments on that video (basically looks good except that division uses sv.madded and sv.subfe not sv.msubed and sv.subfe): https://libre-soc.org/irclog/%23libre-soc.2022-04-21.log.html#t2022-04-21T19:38:17

(In reply to Jacob Lifshay from comment #0) > Couldn't find a bug for this so made one. ta. forgot. > My comments on that video (basically looks good except that division uses > sv.madded and sv.subfe not sv.msubed and sv.subfe): > https://libre-soc.org/irclog/%23libre-soc.2022-04-21.log.html#t2022-04-21T19: > 38:17 msubed and madded are *not* the same as what was initially proposed (that was maddx and msubx). msubed and madded are the *new* idea (you came up with) but the divmnu.c source has *not* been updated. i easily worked out the mul case, updated mulmnu.c to match: div i can't handle.

(In reply to Luke Kenneth Casson Leighton from comment #1) > (In reply to Jacob Lifshay from comment #0) > > My comments on that video (basically looks good except that division uses > > sv.madded and sv.subfe not sv.msubed and sv.subfe): > > https://libre-soc.org/irclog/%23libre-soc.2022-04-21.log.html#t2022-04-21T19: > > 38:17 > > msubed and madded are *not* the same as what was initially > proposed (that was maddx and msubx). yup. > > msubed and madded are the *new* idea (you came up with) but the > divmnu.c source has *not* been updated. yeah, as stated on irc, sorry i haven't yet fixed it. > > i easily worked out the mul case, updated mulmnu.c to match: > div i can't handle. yup.

(In reply to Jacob Lifshay from comment #2) > (In reply to Luke Kenneth Casson Leighton from comment #1) > > msubed and madded are *not* the same as what was initially > > proposed (that was maddx and msubx). > > yup. > > > > > msubed and madded are the *new* idea (you came up with) but the > > divmnu.c source has *not* been updated. > > yeah, as stated on irc, sorry i haven't yet fixed it. except that i only came up with madded, msubed is something that you apparently interpolated out of msubx and madded -- msubed isn't needed for the div algorithm since the required negation is already done by subfe.

(In reply to Jacob Lifshay from comment #3) > > yeah, as stated on irc, sorry i haven't yet fixed it. no problem. haven't been upstairs yet to acknowledge, sorry. > except that i only came up with madded, msubed is something that you > apparently interpolated out of msubx and madded it seemed logical at the time! but if not needed that's great, one less instruction and it entirely frees up the right half of EXT04 space. maddld maddhd MADDED maddhdu -- -- -- -- > -- msubed isn't needed for > the div algorithm since the required negation is already done by subfe. nggggh ok makes sense. mul... mul... arg yes of course, RC is used as the 64-bit carry on the *product*, nothing to do with the partial result, yes, that goes into subfe. so of course no mul-subtract. rright, some hatchet-jobs and writeup to do...

https://stackoverflow.com/questions/2661541/picking-good-first-estimates-for-goldschmidt-division goldschmidt division https://libre-soc.org/irclog/%23libre-soc.2022-04-21.log.html#t2022-04-21T22:34:12 https://en.m.wikipedia.org/wiki/Methods_of_computing_square_roots#Goldschmidt%E2%80%99s_algorithm

see also: https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Goldschmidt%E2%80%99s_algorithm another thing worth researching is: can goldschmidt's division algorithm be modified to work for polynomials over gf(2) (carry-less division/remainder)?

(In reply to Jacob Lifshay from comment #2) > yeah, as stated on irc, sorry i haven't yet fixed it. I updated divmnu64.c to include the MADDED_SUBFE variant, as well as fixing a bug in mulmnu.c. I also formatting them and switched the tests to use an array of structs rather than a plain array -- avoiding the need to do a bunch of magic math to get the correct array indexes. I also added an indication to the test cases in divmnu64.c so that it knows which cases are the expected errors allowing it to handle them better.

I implemented a WIP version of goldschmidt division, for the 128x64->64-bit division instructions. I didn't yet figure out the correct values for the additional internal precision needed, but I found a paper describing how to calculate them ... they showed that the parameters used for AMD's K8 were too loose, meaning the K8 has more internal precision than needed, wasting gates. https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=cc2c51cf2cd7b22f7d18871fbfb73e0b7707d91a The paper I found: Even, G., Seidel, P. M., & Ferguson, W. E. (2003). A Parametric Error Analysis of Goldschmidt's Division Algorithm. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.1238&rep=rep1&type=pdf

(In reply to Jacob Lifshay from comment #7) > I updated divmnu64.c to include the MADDED_SUBFE variant, as well as fixing > a bug in mulmnu.c. nice. > I also formatting them and switched the tests to use an array of structs > rather than a plain array -- avoiding the need to do a bunch of magic math > to get the correct array indexes. :) > I also added an indication to the test cases in divmnu64.c so that it knows > which cases are the expected errors allowing it to handle them better. yes it was a little odd format.

245 io_width: int 246 """bit-width of the input divisor and the result. 247 the input numerator is `2 * io_width`-bits wide. 248 """ remember to either split these by newline, or: * shorten the parameter (abbrev wds: wid not width) * use a semicolon * use single-quotes (""" is not a hard requirement) * remove the unnecessary type information * use less words ("the" is the first casualty) the priority is to get these below 80 chars and to get them compact. extra words *do not* increase clarity (just as ironically not *enough* causes ambiguity) example: io_wid; "input and result bit-width. numerator is 2xio_wid" it's a tricky balance.

oink moment, why are we not doing Big-Integer Goldschmidt? :) https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html

https://www.researchgate.net/publication/44384708_Improved_algorithms_and_hardware_designs_for_division_by_convergence apparently larger multipliers are required

(In reply to Luke Kenneth Casson Leighton from comment #11) > oink moment, why are we not doing Big-Integer Goldschmidt? :) > > https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html because goldschmidt's algorithm is much slower at small bigint sizes (small enough that the fastest multiplication algorithm is O(N^2)): for a 2*N x N -> N word division, Algorithm D does approximately 2*N*N word operations -- Goldschmidt Division does around log2(N)*8*N*N word operations. When bigints are bigger, then there are faster software algorithms than either Algorithm D or Goldschmidt Division, such as divide and conquer algorithms (good for medium-length bigints) and Block-Wise Barrett Division (good for large bigints). There's also the important case of a bigint divided by a single word, where the algorithm is basically: uint64_t n[n_size], d; // n has most-significant word in n[n_size - 1] uint64_t carry = 0; for(i = n_size; i > 0; i--) { uint128_t v = ((uint128_t)carry << 64) | n[i - 1]; carry = v % d; n[i - 1] = v / d; } // n is the quotient, carry is the remainder see https://gmplib.org/manual/Division-Algorithms

Replying to: https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html > a new 128/64 scalar divide instruction would be > > dividend[0:(XLEN*2)-1] <- (RA) || (RS) > > where RS=RA+1 just like lq and stq. no...it should instead be: divrem2du rt, ra, rb, rc v = (ra << 64) | rb; d = rc; // copy first in case rt is rc rt = UDIV(v, rc); ra = UREM(v, rc); because that's needed for bigint by word division: uint64_t n[n_size], d; // n has most-significant word in n[n_size - 1] uint64_t carry = 0; for(i = n_size; i > 0; i--) { uint128_t v = ((uint128_t)carry << 64) | n[i - 1]; carry = v % d; n[i - 1] = v / d; } // n is the quotient, carry is the remainder li carry, 0 set_remap_reverse sv.divrem2du n.v, carry.s, n.v, d.s // n is the quotient, carry is the remainder we'd probably also want a divrem2d for signed div/rem since that's also needed for faster 128-bit arithmetic (which is likely way more common than bigints)

(In reply to Jacob Lifshay from comment #14) > no...it should instead be: > divrem2du rt, ra, rb, rc > v = (ra << 64) | rb; > d = rc; // copy first in case rt is rc > rt = UDIV(v, rc); > ra = UREM(v, rc); whoops, i meant: rt = UDIV(v, d); ra = UREM(v, d);

(In reply to Jacob Lifshay from comment #14) > Replying to: > https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html > > a new 128/64 scalar divide instruction would be > > > > dividend[0:(XLEN*2)-1] <- (RA) || (RS) > > > > where RS=RA+1 just like lq and stq. > > no...it should instead be: > divrem2du rt, ra, rb, rc that's a no on any more VA-Form 4-operand math operations, and a hard-no because there's no precedent to follow. madd* etc. set the precedent already so just barely justifies one additional instruction. following lq/stq is sort-of-fine [i may have got the dividend the wrong way round, (RS) || (RA) rather than (RA) || (RS). > we'd probably also want a divrem2d for signed div/rem since that's also > needed for faster 128-bit arithmetic (which is likely way more common than > bigints) again: hard no. there does not exist a divrem scalar instruction therefore justifying the addition of four+ instructions when the space in EXT04 is so precious is not something i am comfortable - at all - putting in front of the OPF ISA WG. when there is opcode pressure like this we simply cannot just blithely and arbitrarily propose instructions. especially when the other options - big-integer goldschmidt - have not been explored. i made a start here: https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divgnu64.c;hb=HEAD

(In reply to Luke Kenneth Casson Leighton from comment #16) > (In reply to Jacob Lifshay from comment #14) > > Replying to: > > https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html > > > a new 128/64 scalar divide instruction would be > > > > > > dividend[0:(XLEN*2)-1] <- (RA) || (RS) > > > > > > where RS=RA+1 just like lq and stq. > > > > no...it should instead be: > > divrem2du rt, ra, rb, rc > > that's a no on any more VA-Form 4-operand math operations, and a hard-no > because there's no precedent to follow. there's sorta precedent on other architectures: the x86 div and idiv instructions that do 128 by 64-bit div/rem. also sparc has 64 by 32-bit divide instructions. > madd* etc. set the precedent > already so just barely justifies one additional instruction. > > following lq/stq is sort-of-fine [i may have got the dividend the wrong > way round, (RS) || (RA) rather than (RA) || (RS). no, because that bigint by word division needs the high half of the numerator to be scalar and the low half to be vector. having an implicit high half means they have to be both scalar or both vector--unusable to do division with a single sv.divrem2du. > > we'd probably also want a divrem2d for signed div/rem since that's also > > needed for faster 128-bit arithmetic (which is likely way more common than > > bigints) > > again: hard no. > > there does not exist a divrem scalar instruction therefore justifying > the addition of four+ two: divrem2du and divrem2d. > instructions when the space in EXT04 is so precious > is not something i am comfortable - at all - putting in front of the > OPF ISA WG. > > when there is opcode pressure like this we simply cannot just blithely > and arbitrarily propose instructions. > > especially when the other options - big-integer goldschmidt - have not > been explored. i already explained that, for smaller bigints (around 8 words), big-integer goldschmidt needs like 30x as many word operations as algorithm d. algorithm d is necessary. > i made a start here: > > https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/ > divgnu64.c;hb=HEAD

(In reply to Jacob Lifshay from comment #14) > Replying to: > because that's needed for bigint by word division: > uint64_t n[n_size], d; // n has most-significant word in n[n_size - 1] > uint64_t carry = 0; > for(i = n_size; i > 0; i--) > { > uint128_t v = ((uint128_t)carry << 64) | n[i - 1]; > carry = v % d; > n[i - 1] = v / d; > } ... oink. argh argh damnit. missed this. that's going to put huge pressure on those 4 remaining VA-Form slots in EXT04. how can that be justified to OPF ISA WG. ok let me think it through. * v is 128 bit constructed from 2 64-bit. * output1 is modulo of divisor * output2 is modulo of divisor by careful selection the same RC trick applies as to bigmul (madded) by carefull-er selection it can also do scalar 128/64 div and/or mod which also makes it useful for Algorithm D. if RC contains carry and RC=0 as well as RT=0 is allowed then selection of whether it *only* performs div or only mod is possible. by having an astounding amount of options i think we have a justification for this precious opcode space. let's see... adapt to swap RA and RC to make it similar to madded, add zero conditional checking... divrem2du rt, ra, rb, rc v = (rc << 64) | rb; d = ra; // copy first in case rt is ra if rt != 0 rt = UDIV(v, d); if RS != 0 RS = UREM(v, d); same types of modes: * RS=RT+VL mode when EXTRA bit 8 is 0 (default for scalar) * RS=RC mode when EXTRA bit 8 is 1 damnit signalling when not to write the remainder is difficult because RS is implicit and can be a source. how about: if rt != ra rt = UDIV(v, d); if RT != 0 RS = UREM(v, d); messy. if there were not 4 slots remaining in the EXT04 VA-Form i'd say use 1 bit to decide if mod is written, and use RT!=0 for div result, but that takes the *entire* space if divrem2ds is added as well. tough call.

(In reply to Jacob Lifshay from comment #17) > > following lq/stq is sort-of-fine [i may have got the dividend the wrong > > way round, (RS) || (RA) rather than (RA) || (RS). > > no, because that bigint by word division needs the high half of the > numerator to be scalar and the low half to be vector. having an implicit > high half means they have to be both scalar or both vector--unusable to do > division with a single sv.divrem2du. yep got it now, was still thinking in terms of the (early) mulx, i hadn't twigged about the feedback loop option just like in madded. > two: divrem2du and divrem2d. times two for one bit to select if mod is written. have to think that through, can it be achieved with RA or RT zero to stop div or rem being stored. > i already explained that, for smaller bigints (around 8 words), big-integer > goldschmidt needs like 30x as many word operations as algorithm d. algorithm > d is necessary. yeah i'm catching up. went to bugzilla and saw 2 posts, my brain was in linux-microwatt mode. i'll need to do a mental walkthrough and also it would be useful to split out in the divmnu64.c you can see i created some functions bigmul bigsub bigadd in divgnu64.c

ok i updated the analysis page, no conditional "if RT != 0" because i realised, if going to do that, might as well use X-Form RT,RA,RB with implied RS=RA+VL and do dividend = (RA) || (RS) and have divqd modqd for 128/64 that reduces pressure on EXT04. now, question is, does signed-div make any sense? it didn't in big-mul because a corrective factor can be done after.

(In reply to Luke Kenneth Casson Leighton from comment #20) > now, question is, does signed-div make any sense? it didn't > in big-mul because a corrective factor can be done after. line 34 https://github.com/hcs0/Hackers-Delight/blob/master/mulmns.c.txt

(In reply to Luke Kenneth Casson Leighton from comment #20) > now, question is, does signed-div make any sense? I think it does because I'm pretty sure it can be used to make a faster 128-bit signed integer div/rem algorithm. I couldn't find any decent examples since I think everyone either implements it in c in terms of u128 div/rem implemented in c (aka slow inefficient because c doesn't have 128 by 64-bit div), or implements it in terms of an assembly u128 div algorithm with a comment that the algorithm should be optimized more...

(In reply to Luke Kenneth Casson Leighton from comment #20) > ok i updated the analysis page, > `divrem2du RT,RA,RB,RC` > > divisor = (RC) || (RB) > dividend = EXTZ128(RA) > RT = UDIV(dividend, divisor) > RS = UREM(dividend, divisor) > > Again, in an SVP64 context, using EXTRA mode bit 8 allows for > selecting whether `RS=RC` or > `RS=RT+VL`. looks good!

how about inverting un and vn if -ve then correcting if signs not equal? inversion of bigint can be done with RB=scalar-zero, subtract RA as vector. in case where n=1 then inversion of scalar v is easy. leaves overflow etc to deal with but not a big deal. am all for avoiding adding unnecessary ops here.

(In reply to Jacob Lifshay from comment #23) > (In reply to Luke Kenneth Casson Leighton from comment #20) > > divisor = (RC) || (RB) > > dividend = EXTZ128(RA) > > RT = UDIV(dividend, divisor) > > RS = UREM(dividend, divisor) > looks good! detecting the overflow conditions need adapting, in fixedarith divdeu it is: dividend[0:(XLEN*2)-1] <- (RA) || [0]*XLEN divisor[0:(XLEN*2)-1] <- [0]*XLEN || (RB) result <- dividend / divisor if (RA) <u (RB) then RT <- result[XLEN:(XLEN*2)-1] overflow <- 0 is it as simple as this: dividend = (RC) || (RB) divisor = EXTZ128(RA) if (((RB) = 0) & ((RC) <u (RA)) | ((RB) != 0) & ((RC)+1 <u (RA))) then RT = UDIV(dividend, divisor) RS = UREM(dividend, divisor) overflow = 0 or is it necessary to do this (a 192-bit compare) if ([0]*XLEN || dividend) <u (divisor || [0]*XLEN) it's looking for the scenario where the result cannot fit into 64 bit, right?

(In reply to Luke Kenneth Casson Leighton from comment #25) > (In reply to Jacob Lifshay from comment #23) > > (In reply to Luke Kenneth Casson Leighton from comment #20) > > > divisor = (RC) || (RB) > > > dividend = EXTZ128(RA) > > > RT = UDIV(dividend, divisor) > > > RS = UREM(dividend, divisor) > > looks good! actually, that's *totally broken*...I just now noticed. you want dividend to be (RC) || (RB), not divisor. they're accidentally swapped. > detecting the overflow conditions need adapting, > in fixedarith divdeu it is: > it's looking for the scenario where the result > cannot fit into 64 bit, right? yeah. I'd just write it as: `divrem2du RT,RA,RB,RC` dividend = (RC) || (RB) divisor = (RA) if divisor == 0 or (RC) >= divisor: overflow = 1 RT = 0xFFFF_FFFF_FFFF_FFFF # saturate .. makes some stuff nicer RS = 0 # pick a value that won't be bigger than RC else: RT = UDIV(dividend, EXTZ128(divisor)) RS = UREM(dividend, EXTZ128(divisor))

(In reply to Luke Kenneth Casson Leighton from comment #25) > is it as simple as this: > > dividend = (RC) || (RB) > divisor = EXTZ128(RA) > if (((RB) = 0) & ((RC) <u (RA)) | > ((RB) != 0) & ((RC)+1 <u (RA))) then the reason you only need to compare RC with RA and don't need to compare RB is because when RC == RA - 1, then all values of RB will still cause the division to round down to fitting in 64-bits. when RC == RA, if RB == 0 that's still big enough to overflow, if RB > 0 then that's even bigger and will definitely overflow.

yep corrected pseudocode, spotted late at night compared to divdeu, updated. RA <u RB makes sense now, RA || fffff..ff / RB still fits (not 0x1_0000_0000) no on non-orthogonal behaviour on overflow, UNDEFINED and has to be that way. otherwise a separate RFC has to be made to get *all other* div ops to be so defined (fffffff) and it's too late for that. bit of a pain.

128 // Compute estimate qhat of q[j] from top 2 digits 129 uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1]; 130 qhat = dig2 / vn[n - 1]; 131 rhat = dig2 % vn[n - 1]; ehm.... ehmehmehm... qhat and rhat are uint64_t (and un/vn 32 bit*) therefore when vn/un are uint64_t[] qhat and rhat must be uint128_t which is NOT happening, we are NOT going to return QTY 4 64 bit registers from divmod2du. thus even on scalar usage if un[j+n] > vn[n-1] an overflow on the 128/64 divide will occur. any clues?

(In reply to Luke Kenneth Casson Leighton from comment #29) > thus even on scalar usage if un[j+n] > vn[n-1] an overflow on the 128/64 > divide will occur. > > any clues? working it out (and this is for n=0): 98 k = 0; // the case of a 99 for (j = m - 1; j >= 0; j--) 100 { // single-digit 101 uint64_t dig2 = (k * b + u[j]); 102 q[j] = dig2 / v[0]; // divisor here. 103 k = dig2 % v[0]; // modulo bak into next loop 104 } * on first entry, k=0 and v[0] must be >0 otherwise div-overflow - therefore RA (0) < RB (v[0]) and we're good. - and RS = modulo v[0] which is *ALWAYS*, by definition, between 0 and v[0]-1 * on second entry therefore, k will ALWAYS be between 0 and v[0]-1 - therefore RA will ALWAYS be below RB (v[0]) so this is good, for the scalar-vector case. but, for that qhat estimate, it goes pearshaped... with the code as-written. if instead however divmnu64.c calls *itself* (because the division estimate is to be on the MSB of v) then it should end up using k=0 lines 98-104 and all will be well

ha. simple. this: 128 // Compute estimate qhat of q[j] from top 2 digits 129 uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1]; 130 qhat = dig2 / vn[n - 1]; 131 rhat = dig2 % vn[n - 1]; becomes a 2-long unrolling of the loop @ line 98, like this: uint64_t dig1 = (uint64_t)un[j + n]; qhat = dig1 / vn[n - 1]; rhat = dig1 % vn[n - 1]; uint64_t dig2 = (rhat << 32) | un[j + n - 1]; qhat = qhat<<32 | dig2 / vn[n - 1]; rhat = rhat<<32 | dig2 % vn[n - 1]; (translate to 128/64 rather than 64/32 of course).

(In reply to Luke Kenneth Casson Leighton from comment #29) > 128 // Compute estimate qhat of q[j] from top 2 digits > 129 uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1]; > 130 qhat = dig2 / vn[n - 1]; > 131 rhat = dig2 % vn[n - 1]; > > ehm.... ehmehmehm... qhat and rhat are uint64_t (and un/vn 32 bit*) > therefore when vn/un > are uint64_t[] qhat and rhat must be uint128_t which is NOT happening, > we are NOT going to return QTY 4 64 bit registers from divmod2du. > > thus even on scalar usage if un[j+n] > vn[n-1] an overflow on the 128/64 > divide will occur. > > any clues? yes...knuth's original algorithm is written in mix assembly where the operation used is a 2-word by 1-word div/rem where both quotient and remainder are 1 word. the only case where that can overflow is where un[j + n] == vn[n - 1] in which case it sets qhat to the max value that fits in a single word (in the 64-bit case, 2**64-1). the initial normalization step creates the invariant that un[j + n] <= vn[n - 1] and that invariant is maintained by the loop. the c code just has qhat and rhat be 2-word values to prevent UB when the overflow occurs, and also for convenience to avoid needing to constantly cast them to 2-word values. they can totally be 1-word values at the cost of putting the initial qhat computation inside of an if that checks for overflow. original algorithm step: D3. [Calculate q̂.] If uⱼ = v₁, set q̂ ← b - 1; otherwise set q̂ ← ⌊(uⱼb + uⱼ₊₁)/v₁⌋. Now test if v₂q̂ > (uⱼb + uⱼ₊₁ - q̂v₁)b + uⱼ₊₂; if so, decrease q̂ by 1 and repeat this test. (The latter test determines at high speed most of the cases in which the trial value q̂ is one too large, and it eliminates all cases where q̂ is two too large; see exercises 19, 20, 21.) in ascii: D3. [Calculate qhat.] If u[j] = v[1], set qhat <- b - 1; otherwise set qhat <- floor((u[j]*b + u[j+1])/v[1]). Now test if v[2]*qhat > (u[j]*b + u[j+1] - qhat*v[1])*b + u[j+2]; if so, decrease qhat by 1 and repeat this test. (The latter test determines at high speed most of the cases in which the trial value qhat is one too large, and it eliminates all cases where qhat is two too large; see exercises 19, 20, 21.)

(In reply to Jacob Lifshay from comment #32) > yes...knuth's original algorithm is written in mix assembly where the > operation used is a 2-word by 1-word div/rem where both quotient and > remainder are 1 word. the only case where that can overflow is where un[j + > n] == vn[n - 1] in which case it sets qhat to the max value that fits in a > single word (in the 64-bit case, 2**64-1). hence the idea of setting to all 1s. i think... given this is a new type of instruction (dual divmod) we can get away with that. elwidth overrides make it useful for other data widths.

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=b6044fdb84bea61887999ffa5e17543fa91b8f89 good, whew. i have to explain all this :) caveat here: divmid2du communicating "overflow", that's... expensive. CR0 and/or OE=1. OE=1 is not going to fit into VA-Form EXT04. each is an extra register. making the operation always "divmid2du." would do the job, but, dang, that's 3-in, 3-out.

(In reply to Luke Kenneth Casson Leighton from comment #34) > https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff; > h=b6044fdb84bea61887999ffa5e17543fa91b8f89 > > good, whew. i have to explain all this :) > > caveat here: divmid2du communicating "overflow", that's... expensive. > CR0 and/or OE=1. OE=1 is not going to fit into VA-Form EXT04. > each is an extra register. we'd want writing to CR0, since it saves an instruction...otherwise you'd need the move OV to CR0 instruction before the conditional branch. > > making the operation always "divmid2du." would do the job, but, dang, > that's 3-in, 3-out. well...is 3-in 3-out more expensive than needing an extra compare instruction in an important use of divmodqdu? i guess we could, instead of writing CR0 or OV, just have it compare like so: cmpl n_hi, d divmodqdu qhat, d, n_lo, n_hi the cmpl takes extra resources...in a non-superscalar system takes an extra clock cycle. we can't put it after the divmodqdu (which would save a clock cycle on non-superscalar systems) since n_hi is overwritten, though otoh we likely need to have another copy of n_hi in a register anyway and cmpl could just use that.

(In reply to Jacob Lifshay from comment #35) > well...is 3-in 3-out more expensive than needing an extra compare > instruction in an important use of divmodqdu? hell yes. that's 6 FFs inside an entire Dependency Matrix Row. > i guess we could, instead of writing CR0 or OV, just have it compare like so: > cmpl n_hi, d > divmodqdu qhat, d, n_lo, n_hi much better / sane

See also: LLVM adding big-int division (fixed-length ints only) to its compiler builtins library, allowing LLVM to correctly compile all int types (even 16384 bit). https://reviews.llvm.org/D120327

Mentioned our work on Knuth's Algorithm D over on LLVM Phabricator where they're currently implementing big-int division: https://reviews.llvm.org/D120327#3493233 I have been helping review that merge request -- they originally had Algorithm D but switched to a slower simpler algorithm, I convinced them that the library API should support replacing the algorithm with Algorithm D again, which requires passing in sufficiently sized buffers and allowing the inputs to be modified.

just realised that another variant of madded, call it say maddd, could be used for high accuracy dotproduct (bug #555). it would be: prod[0:127] = (RA) * (RB) sum[0:127] = (RC || [0]*64) + prod RT <- sum[64:127] RS <- sum[0:63] # RS is either RC or RT+VL but the question is: is it worth doing in the integer space? in FP, using 2 regs to store the intermediate result (FP128), yes. but integer? honestly i have no idea

(In reply to Luke Kenneth Casson Leighton from comment #39) > just realised that another variant of madded, call it say maddd, could be > used for high accuracy dotproduct (bug #555). > > it would be: > > prod[0:127] = (RA) * (RB) > sum[0:127] = (RC || [0]*64) + prod > RT <- sum[64:127] > RS <- sum[0:63] # RS is either RC or RT+VL that isn't what you want...you want the existing maddld instruction, but extended to have rt and rc be 128-bit register pairs. > > but the question is: is it worth doing in the integer space? imho no. the existing integer maddld already returns exact results for dot product, mod 2^64. the only way we might want a different instruction is for 64x64->128-bit dot product, which would be a 4-in 2-out instruction, which imho is too many regs, otherwise we'd also be ok with bigint*word+bigint muladd which is a 4-in 2-out op which you rejected previously. > in FP, using 2 regs to store the intermediate result (FP128), this has the exact same problem as bigint*word+bigint muladd which is a 4-in 2-out op...admittedly it has a better motivation than integer dot product though. another alternative for fp dot product with higher precision is just doing a vector fmul followed by a tree reduction fadd, which retains most of the precision due to values being rounded O(log N) times rather than O(N) times from a serial reduction.

(In reply to Jacob Lifshay from comment #40) > that isn't what you want...you want the existing maddld instruction, but > extended to have rt and rc be 128-bit register pairs. damn. which is 4-in, 2-out. 6 64-bit register hazards is far too much. > another alternative for fp dot product with higher precision is just doing a > vector fmul followed by a tree reduction fadd, which retains most of the > precision due to values being rounded O(log N) times rather than O(N) times > from a serial reduction. like it.

apparently Rust is getting bigint helper functions too: tracking issue: https://github.com/rust-lang/rust/issues/85532 public API: > // On unsigned integers: > > /// `self + rhs + carry` (full adder) > fn carrying_add(self, rhs: Self, carry: bool) -> (Self, bool); > > /// `self - rhs - carry` (full "subtractor") > fn borrowing_sub(self, rhs: Self, carry: bool) -> (Self, bool); > > /// `self * rhs + carry` (multiply-accumulate) > fn carrying_mul(self, rhs: Self, carry: Self) -> (Self, Self); > > /// `self * rhs` (wide multiplication, same as `self.carrying_mul(rhs, 0)`) > fn widening_mul(self, rhs: Self) -> (Self, Self); > > > // On signed integers: > > /// `self + rhs + carry` (full adder) > fn carrying_add(self, rhs: Self, carry: bool) -> (Self, bool); > > /// `self - rhs - carry` (full "subtractor") > fn borrowing_sub(self, rhs: Self, carry: bool) -> (Self, bool);