Bug 817 - Big Integer Math (sv.adde, sv.subfe, sv.madded, 128 by 64-bit -> 64-bit div/rem, maybe more...)
Summary: Big Integer Math (sv.adde, sv.subfe, sv.madded, 128 by 64-bit -> 64-bit div/r...
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL: https://libre-soc.org/openpower/sv/bi...
Depends on: 937
Blocks: 589 771 772
  Show dependency treegraph
 
Reported: 2022-04-21 20:01 BST by Jacob Lifshay
Modified: 2022-09-25 21:32 BST (History)
3 users (show)

See Also:
NLnet milestone: NLnet.2021.02A.052.CryptoRouter
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Lifshay 2022-04-21 20:01:55 BST
Couldn't find a bug for this so made one.

https://libre-soc.org/openpower/sv/biginteger/

Giant Discussion on mailing list:
https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004761.html

Video lkcl made on draft version:
https://youtu.be/8hrIG7-E77o

My comments on that video (basically looks good except that division uses sv.madded and sv.subfe not sv.msubed and sv.subfe):
https://libre-soc.org/irclog/%23libre-soc.2022-04-21.log.html#t2022-04-21T19:38:17
Comment 1 Luke Kenneth Casson Leighton 2022-04-21 21:01:05 BST
(In reply to Jacob Lifshay from comment #0)
> Couldn't find a bug for this so made one.

ta. forgot.
 
> My comments on that video (basically looks good except that division uses
> sv.madded and sv.subfe not sv.msubed and sv.subfe):
> https://libre-soc.org/irclog/%23libre-soc.2022-04-21.log.html#t2022-04-21T19:
> 38:17

msubed and madded are *not* the same as what was initially
proposed (that was maddx and msubx).

msubed and madded are the *new* idea (you came up with) but the
divmnu.c source has *not* been updated.

i easily worked out the mul case, updated mulmnu.c to match:
div i can't handle.
Comment 2 Jacob Lifshay 2022-04-21 21:04:12 BST
(In reply to Luke Kenneth Casson Leighton from comment #1)
> (In reply to Jacob Lifshay from comment #0)
> > My comments on that video (basically looks good except that division uses
> > sv.madded and sv.subfe not sv.msubed and sv.subfe):
> > https://libre-soc.org/irclog/%23libre-soc.2022-04-21.log.html#t2022-04-21T19:
> > 38:17
> 
> msubed and madded are *not* the same as what was initially
> proposed (that was maddx and msubx).

yup.

> 
> msubed and madded are the *new* idea (you came up with) but the
> divmnu.c source has *not* been updated.

yeah, as stated on irc, sorry i haven't yet fixed it.

> 
> i easily worked out the mul case, updated mulmnu.c to match:
> div i can't handle.

yup.
Comment 3 Jacob Lifshay 2022-04-21 21:09:00 BST
(In reply to Jacob Lifshay from comment #2)
> (In reply to Luke Kenneth Casson Leighton from comment #1)
> > msubed and madded are *not* the same as what was initially
> > proposed (that was maddx and msubx).
> 
> yup.
> 
> > 
> > msubed and madded are the *new* idea (you came up with) but the
> > divmnu.c source has *not* been updated.
> 
> yeah, as stated on irc, sorry i haven't yet fixed it.

except that i only came up with madded, msubed is something that you apparently interpolated out of msubx and madded -- msubed isn't needed for the div algorithm since the required negation is already done by subfe.
Comment 4 Luke Kenneth Casson Leighton 2022-04-21 21:16:31 BST
(In reply to Jacob Lifshay from comment #3)

> > yeah, as stated on irc, sorry i haven't yet fixed it.

no problem. haven't been upstairs yet to acknowledge, sorry.

> except that i only came up with madded, msubed is something that you
> apparently interpolated out of msubx and madded 

it seemed logical at the time! but if not needed that's great, one less
instruction and it entirely frees up the right half of EXT04 space.

    maddld maddhd MADDED maddhdu -- -- -- --

> -- msubed isn't needed for
> the div algorithm since the required negation is already done by subfe.

nggggh ok makes sense.

mul... mul... arg yes of course, RC is used as the 64-bit carry on the
*product*, nothing to do with the partial result, yes, that goes into
subfe.

so of course no mul-subtract.

rright, some hatchet-jobs and writeup to do...
Comment 6 Jacob Lifshay 2022-04-21 23:27:30 BST
see also:
https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Goldschmidt%E2%80%99s_algorithm

another thing worth researching is: can goldschmidt's division algorithm be modified to work for polynomials over gf(2) (carry-less division/remainder)?
Comment 7 Jacob Lifshay 2022-04-22 04:17:43 BST
(In reply to Jacob Lifshay from comment #2)
> yeah, as stated on irc, sorry i haven't yet fixed it.

I updated divmnu64.c to include the MADDED_SUBFE variant, as well as fixing a bug in mulmnu.c.

I also formatting them and switched the tests to use an array of structs rather than a plain array -- avoiding the need to do a bunch of magic math to get the correct array indexes.

I also added an indication to the test cases in divmnu64.c so that it knows which cases are the expected errors allowing it to handle them better.
Comment 8 Jacob Lifshay 2022-04-22 09:06:39 BST
I implemented a WIP version of goldschmidt division, for the 128x64->64-bit division instructions. I didn't yet figure out the correct values for the additional internal precision needed, but I found a paper describing how to calculate them ... they showed that the parameters used for AMD's K8 were too loose, meaning the K8 has more internal precision than needed, wasting gates.

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=cc2c51cf2cd7b22f7d18871fbfb73e0b7707d91a

The paper I found:
Even, G., Seidel, P. M., & Ferguson, W. E. (2003).
A Parametric Error Analysis of Goldschmidt's Division Algorithm.
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.1238&rep=rep1&type=pdf
Comment 9 Luke Kenneth Casson Leighton 2022-04-22 09:07:05 BST
(In reply to Jacob Lifshay from comment #7)
 
> I updated divmnu64.c to include the MADDED_SUBFE variant, as well as fixing
> a bug in mulmnu.c.

nice.

> I also formatting them and switched the tests to use an array of structs
> rather than a plain array -- avoiding the need to do a bunch of magic math
> to get the correct array indexes.

:)
 
> I also added an indication to the test cases in divmnu64.c so that it knows
> which cases are the expected errors allowing it to handle them better.

yes it was a little odd format.
Comment 10 Luke Kenneth Casson Leighton 2022-04-24 04:14:50 BST
 245     io_width: int
 246     """bit-width of the input divisor and the result.
 247     the input numerator is `2 * io_width`-bits wide.
 248     """

remember to either split these by newline, or:

* shorten the parameter (abbrev wds: wid not width)
* use a semicolon
* use single-quotes (""" is not a hard requirement)
* remove the unnecessary type information
* use less words ("the" is the first casualty)

the priority is to get these below 80 chars and to get them compact.
extra words *do not* increase clarity (just as ironically not *enough* causes
ambiguity)

 example:

    io_wid; "input and result bit-width. numerator is 2xio_wid"

it's a tricky balance.
Comment 11 Luke Kenneth Casson Leighton 2022-04-24 12:25:02 BST
oink moment, why are we not doing Big-Integer Goldschmidt? :)

https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html
Comment 12 Luke Kenneth Casson Leighton 2022-04-24 12:31:47 BST
https://www.researchgate.net/publication/44384708_Improved_algorithms_and_hardware_designs_for_division_by_convergence

apparently larger multipliers are required
Comment 13 Jacob Lifshay 2022-04-24 17:59:14 BST
(In reply to Luke Kenneth Casson Leighton from comment #11)
> oink moment, why are we not doing Big-Integer Goldschmidt? :)
> 
> https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html

because goldschmidt's algorithm is much slower at small bigint sizes (small enough that the fastest multiplication algorithm is O(N^2)):
for a 2*N x N -> N word division, Algorithm D does approximately 2*N*N word operations -- Goldschmidt Division does around log2(N)*8*N*N word operations.

When bigints are bigger, then there are faster software algorithms than either Algorithm D or Goldschmidt Division, such as divide and conquer algorithms (good for medium-length bigints) and Block-Wise Barrett Division (good for large bigints).

There's also the important case of a bigint divided by a single word, where the algorithm is basically:
uint64_t n[n_size], d; // n has most-significant word in n[n_size - 1]
uint64_t carry = 0;
for(i = n_size; i > 0; i--)
{
    uint128_t v = ((uint128_t)carry << 64) | n[i - 1];
    carry = v % d;
    n[i - 1] = v / d;
}
// n is the quotient, carry is the remainder

see https://gmplib.org/manual/Division-Algorithms
Comment 14 Jacob Lifshay 2022-04-24 18:22:34 BST
Replying to:
https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html
> a new 128/64 scalar divide instruction would be
>
>    dividend[0:(XLEN*2)-1] <- (RA) || (RS)
>
> where RS=RA+1 just like lq and stq.

no...it should instead be:
divrem2du rt, ra, rb, rc
v = (ra << 64) | rb;
d = rc; // copy first in case rt is rc
rt = UDIV(v, rc);
ra = UREM(v, rc);

because that's needed for bigint by word division:
uint64_t n[n_size], d; // n has most-significant word in n[n_size - 1]
uint64_t carry = 0;
for(i = n_size; i > 0; i--)
{
    uint128_t v = ((uint128_t)carry << 64) | n[i - 1];
    carry = v % d;
    n[i - 1] = v / d;
}
// n is the quotient, carry is the remainder

li carry, 0
set_remap_reverse
sv.divrem2du n.v, carry.s, n.v, d.s
// n is the quotient, carry is the remainder

we'd probably also want a divrem2d for signed div/rem since that's also needed for faster 128-bit arithmetic (which is likely way more common than bigints)
Comment 15 Jacob Lifshay 2022-04-24 18:25:58 BST
(In reply to Jacob Lifshay from comment #14)
> no...it should instead be:
> divrem2du rt, ra, rb, rc
> v = (ra << 64) | rb;
> d = rc; // copy first in case rt is rc
> rt = UDIV(v, rc);
> ra = UREM(v, rc);
whoops, i meant:
rt = UDIV(v, d);
ra = UREM(v, d);
Comment 16 Luke Kenneth Casson Leighton 2022-04-24 18:35:52 BST
(In reply to Jacob Lifshay from comment #14)
> Replying to:
> https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html
> > a new 128/64 scalar divide instruction would be
> >
> >    dividend[0:(XLEN*2)-1] <- (RA) || (RS)
> >
> > where RS=RA+1 just like lq and stq.
> 
> no...it should instead be:
> divrem2du rt, ra, rb, rc

that's a no on any more VA-Form 4-operand math operations, and a hard-no
because there's no precedent to follow.  madd* etc. set the precedent
already so just barely justifies one additional instruction.

following lq/stq is sort-of-fine [i may have got the dividend the wrong
way round, (RS) || (RA) rather than (RA) || (RS).

> we'd probably also want a divrem2d for signed div/rem since that's also
> needed for faster 128-bit arithmetic (which is likely way more common than
> bigints)

again: hard no.

there does not exist a divrem scalar instruction therefore justifying
the addition of four+ instructions when the space in EXT04 is so precious
is not something i am comfortable - at all - putting in front of the
OPF ISA WG.

when there is opcode pressure like this we simply cannot just blithely
and arbitrarily propose instructions.

especially when the other options - big-integer goldschmidt - have not
been explored.  i made a start here:

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divgnu64.c;hb=HEAD
Comment 17 Jacob Lifshay 2022-04-24 19:14:37 BST
(In reply to Luke Kenneth Casson Leighton from comment #16)
> (In reply to Jacob Lifshay from comment #14)
> > Replying to:
> > https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004772.html
> > > a new 128/64 scalar divide instruction would be
> > >
> > >    dividend[0:(XLEN*2)-1] <- (RA) || (RS)
> > >
> > > where RS=RA+1 just like lq and stq.
> > 
> > no...it should instead be:
> > divrem2du rt, ra, rb, rc
> 
> that's a no on any more VA-Form 4-operand math operations, and a hard-no
> because there's no precedent to follow.

there's sorta precedent on other architectures: the x86 div and idiv instructions that do 128 by 64-bit div/rem. also sparc has 64 by 32-bit divide instructions.

>  madd* etc. set the precedent
> already so just barely justifies one additional instruction.
> 
> following lq/stq is sort-of-fine [i may have got the dividend the wrong
> way round, (RS) || (RA) rather than (RA) || (RS).

no, because that bigint by word division needs the high half of the numerator to be scalar and the low half to be vector. having an implicit high half means they have to be both scalar or both vector--unusable to do division with a single sv.divrem2du.

> > we'd probably also want a divrem2d for signed div/rem since that's also
> > needed for faster 128-bit arithmetic (which is likely way more common than
> > bigints)
> 
> again: hard no.
> 
> there does not exist a divrem scalar instruction therefore justifying
> the addition of four+

two: divrem2du and divrem2d.

> instructions when the space in EXT04 is so precious
> is not something i am comfortable - at all - putting in front of the
> OPF ISA WG.
> 
> when there is opcode pressure like this we simply cannot just blithely
> and arbitrarily propose instructions.
> 
> especially when the other options - big-integer goldschmidt - have not
> been explored.

i already explained that, for smaller bigints (around 8 words), big-integer goldschmidt needs like 30x as many word operations as algorithm d. algorithm d is necessary.

>  i made a start here:
> 
> https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/
> divgnu64.c;hb=HEAD
Comment 18 Luke Kenneth Casson Leighton 2022-04-24 19:32:31 BST
(In reply to Jacob Lifshay from comment #14)
> Replying to:

> because that's needed for bigint by word division:
> uint64_t n[n_size], d; // n has most-significant word in n[n_size - 1]
> uint64_t carry = 0;
> for(i = n_size; i > 0; i--)
> {
>     uint128_t v = ((uint128_t)carry << 64) | n[i - 1];
>     carry = v % d;
>     n[i - 1] = v / d;
> }

... oink.

argh argh damnit.  missed this.  that's going to put huge pressure on those
4 remaining VA-Form slots in EXT04.  how can that be justified to OPF ISA WG.

ok let me think it through.

* v is 128 bit constructed from 2 64-bit.
* output1 is modulo of divisor
* output2 is modulo of divisor

by careful selection the same RC trick applies as to bigmul (madded)

by carefull-er selection it can also do scalar 128/64 div and/or mod
which also makes it useful for Algorithm D.

if RC contains carry and RC=0 as well as RT=0 is allowed then
selection of whether it *only* performs div or only mod is
possible.

by having an astounding amount of options i think we have a justification
for this precious opcode space.

let's see... adapt to swap RA and RC to make it similar to madded,
add zero conditional checking...

    divrem2du rt, ra, rb, rc
    v = (rc << 64) | rb;
    d = ra; // copy first in case rt is ra
    if rt != 0  
        rt = UDIV(v, d);
    if RS != 0
        RS = UREM(v, d);

same types of modes:

* RS=RT+VL mode when EXTRA bit 8 is 0 (default for scalar)
* RS=RC    mode when EXTRA bit 8 is 1

damnit signalling when not to write the remainder is difficult because
RS is implicit and can be a source.  how about:

    if rt != ra 
        rt = UDIV(v, d);
    if RT != 0
        RS = UREM(v, d);

messy.  if there were not 4 slots remaining in the EXT04 VA-Form
i'd say use 1 bit to decide if mod is written, and use RT!=0 for div
result, but that takes the *entire* space if divrem2ds is added as well.

tough call.
Comment 19 Luke Kenneth Casson Leighton 2022-04-24 19:39:53 BST
(In reply to Jacob Lifshay from comment #17)

> > following lq/stq is sort-of-fine [i may have got the dividend the wrong
> > way round, (RS) || (RA) rather than (RA) || (RS).
> 
> no, because that bigint by word division needs the high half of the
> numerator to be scalar and the low half to be vector. having an implicit
> high half means they have to be both scalar or both vector--unusable to do
> division with a single sv.divrem2du.

yep got it now, was still thinking in terms of the (early) mulx,
i hadn't twigged about the feedback loop option just like in madded.

> two: divrem2du and divrem2d.

times two for one bit to select if mod is written. have to think that
through, can it be achieved with RA or RT zero to stop div or rem
being stored.

> i already explained that, for smaller bigints (around 8 words), big-integer
> goldschmidt needs like 30x as many word operations as algorithm d. algorithm
> d is necessary.

yeah i'm catching up. went to bugzilla and saw 2 posts, my brain was
in linux-microwatt mode.

i'll need to do a mental walkthrough and also it would be useful to
split out in the divmnu64.c you can see i created some functions
bigmul bigsub bigadd in divgnu64.c
Comment 20 Luke Kenneth Casson Leighton 2022-04-24 22:39:44 BST
ok i updated the analysis page, no conditional "if RT != 0"
because i realised, if going to do that, might as well
use X-Form RT,RA,RB with implied RS=RA+VL and
do dividend = (RA) || (RS) and have divqd modqd for 128/64

that reduces pressure on EXT04.

now, question is, does signed-div make any sense? it didn't
in big-mul because a corrective factor can be done after.
Comment 21 Luke Kenneth Casson Leighton 2022-04-24 23:42:52 BST
(In reply to Luke Kenneth Casson Leighton from comment #20)
 
> now, question is, does signed-div make any sense? it didn't
> in big-mul because a corrective factor can be done after.
 
line 34

https://github.com/hcs0/Hackers-Delight/blob/master/mulmns.c.txt
Comment 22 Jacob Lifshay 2022-04-25 00:14:19 BST
(In reply to Luke Kenneth Casson Leighton from comment #20)
> now, question is, does signed-div make any sense?

I think it does because I'm pretty sure it can be used to make a faster 128-bit signed integer div/rem algorithm.

I couldn't find any decent examples since I think everyone either implements it in c in terms of u128 div/rem implemented in c (aka slow inefficient because c doesn't have 128 by 64-bit div), or implements it in terms of an assembly u128 div algorithm with a comment that the algorithm should be optimized more...
Comment 23 Jacob Lifshay 2022-04-25 00:27:48 BST
(In reply to Luke Kenneth Casson Leighton from comment #20)
> ok i updated the analysis page, 

> `divrem2du RT,RA,RB,RC`
>
>      divisor = (RC) || (RB)
>      dividend = EXTZ128(RA)
>      RT = UDIV(dividend, divisor)
>      RS = UREM(dividend, divisor)
>
> Again, in an SVP64 context, using EXTRA mode bit 8 allows for
> selecting whether `RS=RC` or
> `RS=RT+VL`.

looks good!
Comment 24 Luke Kenneth Casson Leighton 2022-04-25 00:33:41 BST
how about inverting un and vn if -ve then correcting if signs not equal?

inversion of bigint can be done with RB=scalar-zero, subtract RA as vector.

in case where n=1 then inversion of scalar v is easy.

leaves overflow etc to deal with but not a big deal.

am all for avoiding adding unnecessary ops here.
Comment 25 Luke Kenneth Casson Leighton 2022-04-25 05:58:50 BST
(In reply to Jacob Lifshay from comment #23)
> (In reply to Luke Kenneth Casson Leighton from comment #20)
> >      divisor = (RC) || (RB)
> >      dividend = EXTZ128(RA)
> >      RT = UDIV(dividend, divisor)
> >      RS = UREM(dividend, divisor)
> looks good!

detecting the overflow conditions need adapting,
in fixedarith divdeu it is:

    dividend[0:(XLEN*2)-1] <- (RA) || [0]*XLEN
    divisor[0:(XLEN*2)-1] <- [0]*XLEN || (RB)
    result <- dividend / divisor
    if (RA) <u (RB) then
        RT <- result[XLEN:(XLEN*2)-1]
        overflow <- 0

is it as simple as this:

    dividend = (RC) || (RB)
    divisor = EXTZ128(RA)
    if (((RB) = 0)  & ((RC)   <u (RA)) |
        ((RB) != 0) & ((RC)+1 <u (RA))) then
       RT = UDIV(dividend, divisor)
       RS = UREM(dividend, divisor)
       overflow = 0

or is it necessary to do this (a 192-bit compare)

    if ([0]*XLEN || dividend) <u (divisor || [0]*XLEN)

it's looking for the scenario where the result
cannot fit into 64 bit, right?
Comment 26 Jacob Lifshay 2022-04-25 10:20:56 BST
(In reply to Luke Kenneth Casson Leighton from comment #25)
> (In reply to Jacob Lifshay from comment #23)
> > (In reply to Luke Kenneth Casson Leighton from comment #20)
> > >      divisor = (RC) || (RB)
> > >      dividend = EXTZ128(RA)
> > >      RT = UDIV(dividend, divisor)
> > >      RS = UREM(dividend, divisor)
> > looks good!

actually, that's *totally broken*...I just now noticed. you want dividend to be (RC) || (RB), not divisor. they're accidentally swapped.

> detecting the overflow conditions need adapting,
> in fixedarith divdeu it is:

> it's looking for the scenario where the result
> cannot fit into 64 bit, right?

yeah.

I'd just write it as:
`divrem2du RT,RA,RB,RC`

dividend = (RC) || (RB)
divisor = (RA)
if divisor == 0 or (RC) >= divisor:
    overflow = 1
    RT = 0xFFFF_FFFF_FFFF_FFFF # saturate .. makes some stuff nicer
    RS = 0 # pick a value that won't be bigger than RC
else:
    RT = UDIV(dividend, EXTZ128(divisor))
    RS = UREM(dividend, EXTZ128(divisor))
Comment 27 Jacob Lifshay 2022-04-25 10:30:17 BST
(In reply to Luke Kenneth Casson Leighton from comment #25)
> is it as simple as this:
> 
>     dividend = (RC) || (RB)
>     divisor = EXTZ128(RA)
>     if (((RB) = 0)  & ((RC)   <u (RA)) |
>         ((RB) != 0) & ((RC)+1 <u (RA))) then

the reason you only need to compare RC with RA and don't need to compare RB is because when RC == RA - 1, then all values of RB will still cause the division to round down to fitting in 64-bits. when RC == RA, if RB == 0 that's still big enough to overflow, if RB > 0 then that's even bigger and will definitely overflow.
Comment 28 Luke Kenneth Casson Leighton 2022-04-25 12:18:02 BST
yep corrected pseudocode, spotted late at night compared to divdeu, updated.
RA <u RB makes sense now, RA || fffff..ff / RB still fits (not 0x1_0000_0000)
no on non-orthogonal behaviour on overflow, UNDEFINED and has to be that way.
otherwise a separate RFC has to be made to get *all other* div ops to be
so defined (fffffff) and it's too late for that. bit of a pain.
Comment 29 Luke Kenneth Casson Leighton 2022-04-25 18:12:47 BST
 128         // Compute estimate qhat of q[j] from top 2 digits
 129         uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1];
 130         qhat = dig2 / vn[n - 1];
 131         rhat = dig2 % vn[n - 1];

ehm.... ehmehmehm... qhat and rhat are uint64_t (and un/vn 32 bit*)
therefore when vn/un
are uint64_t[] qhat and rhat must be uint128_t which is NOT happening,
we are NOT going to return QTY 4 64 bit registers from divmod2du.

thus even on scalar usage if un[j+n] > vn[n-1] an overflow on the 128/64
divide will occur.

any clues?
Comment 30 Luke Kenneth Casson Leighton 2022-04-25 19:03:37 BST
(In reply to Luke Kenneth Casson Leighton from comment #29)

 
> thus even on scalar usage if un[j+n] > vn[n-1] an overflow on the 128/64
> divide will occur.
> 
> any clues?

working it out (and this is for n=0):

  98         k = 0; // the case of a
  99         for (j = m - 1; j >= 0; j--)
 100         {                                 // single-digit
 101             uint64_t dig2 = (k * b + u[j]);
 102             q[j] = dig2 / v[0]; // divisor here.
 103             k = dig2 % v[0]; // modulo bak into next loop
 104         }

* on first entry, k=0 and v[0] must be >0 otherwise div-overflow
  - therefore RA (0) < RB (v[0]) and we're good.
  - and RS = modulo v[0] which is *ALWAYS*, by definition,
    between 0 and v[0]-1
* on second entry therefore, k will ALWAYS be between
  0 and v[0]-1
  - therefore RA will ALWAYS be below RB (v[0])

so this is good, for the scalar-vector case.

but, for that qhat estimate, it goes pearshaped... with the
code as-written.

if instead however divmnu64.c calls *itself* (because the
division estimate is to be on the MSB of v) then it should
end up using k=0 lines 98-104 and all will be well
Comment 31 Luke Kenneth Casson Leighton 2022-04-25 20:14:58 BST
ha. simple. this:

 128         // Compute estimate qhat of q[j] from top 2 digits
 129         uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1];
 130         qhat = dig2 / vn[n - 1];
 131         rhat = dig2 % vn[n - 1];

becomes a 2-long unrolling of the loop @ line 98, like this:

         uint64_t dig1 = (uint64_t)un[j + n];
         qhat = dig1 / vn[n - 1];
         rhat = dig1 % vn[n - 1];
         uint64_t dig2 = (rhat << 32) | un[j + n - 1];
         qhat = qhat<<32 | dig2 / vn[n - 1];
         rhat = rhat<<32 | dig2 % vn[n - 1];

(translate to 128/64 rather than 64/32 of course).
Comment 32 Jacob Lifshay 2022-04-25 21:07:49 BST
(In reply to Luke Kenneth Casson Leighton from comment #29)
>  128         // Compute estimate qhat of q[j] from top 2 digits
>  129         uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1];
>  130         qhat = dig2 / vn[n - 1];
>  131         rhat = dig2 % vn[n - 1];
> 
> ehm.... ehmehmehm... qhat and rhat are uint64_t (and un/vn 32 bit*)
> therefore when vn/un
> are uint64_t[] qhat and rhat must be uint128_t which is NOT happening,
> we are NOT going to return QTY 4 64 bit registers from divmod2du.
> 
> thus even on scalar usage if un[j+n] > vn[n-1] an overflow on the 128/64
> divide will occur.
> 
> any clues?

yes...knuth's original algorithm is written in mix assembly where the operation used is a 2-word by 1-word div/rem where both quotient and remainder are 1 word. the only case where that can overflow is where un[j + n] == vn[n - 1] in which case it sets qhat to the max value that fits in a single word (in the 64-bit case, 2**64-1).

the initial normalization step creates the invariant that un[j + n] <= vn[n - 1] and that invariant is maintained by the loop.

the c code just has qhat and rhat be 2-word values to prevent UB when the overflow occurs, and also for convenience to avoid needing to constantly cast them to 2-word values. they can totally be 1-word values at the cost of putting the initial qhat computation inside of an if that checks for overflow.

original algorithm step:
D3. [Calculate q̂.] If uⱼ = v₁, set q̂ ← b - 1; otherwise set q̂ ← ⌊(uⱼb + uⱼ₊₁)/v₁⌋. Now test if v₂q̂ > (uⱼb + uⱼ₊₁ - q̂v₁)b + uⱼ₊₂; if so, decrease q̂ by 1 and repeat this test. (The latter test determines at high speed most of the cases in which the trial value q̂ is one too large, and it eliminates all cases where q̂ is two too large; see exercises 19, 20, 21.)

in ascii:
D3. [Calculate qhat.] If u[j] = v[1], set qhat <- b - 1; otherwise set qhat <- floor((u[j]*b + u[j+1])/v[1]). Now test if v[2]*qhat > (u[j]*b + u[j+1] - qhat*v[1])*b + u[j+2]; if so, decrease qhat by 1 and repeat this test. (The latter test determines at high speed most of the cases in which the trial value qhat is one too large, and it eliminates all cases where qhat is two too large; see exercises 19, 20, 21.)
Comment 33 Luke Kenneth Casson Leighton 2022-04-26 00:50:11 BST
(In reply to Jacob Lifshay from comment #32)

> yes...knuth's original algorithm is written in mix assembly where the
> operation used is a 2-word by 1-word div/rem where both quotient and
> remainder are 1 word. the only case where that can overflow is where un[j +
> n] == vn[n - 1] in which case it sets qhat to the max value that fits in a
> single word (in the 64-bit case, 2**64-1).

hence the idea of setting to all 1s.

i think... given this is a new type of instruction (dual divmod)
we can get away with that.  elwidth overrides make it useful for
other data widths.
Comment 34 Luke Kenneth Casson Leighton 2022-04-26 12:03:50 BST
https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=b6044fdb84bea61887999ffa5e17543fa91b8f89

good, whew.  i have to explain all this :)

caveat here: divmid2du communicating "overflow", that's... expensive.
CR0 and/or OE=1.  OE=1 is not going to fit into VA-Form EXT04.
each is an extra register.

making the operation always "divmid2du." would do the job, but, dang,
that's 3-in, 3-out.
Comment 35 Jacob Lifshay 2022-04-26 18:14:15 BST
(In reply to Luke Kenneth Casson Leighton from comment #34)
> https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;
> h=b6044fdb84bea61887999ffa5e17543fa91b8f89
> 
> good, whew.  i have to explain all this :)
> 
> caveat here: divmid2du communicating "overflow", that's... expensive.
> CR0 and/or OE=1.  OE=1 is not going to fit into VA-Form EXT04.
> each is an extra register.

we'd want writing to CR0, since it saves an instruction...otherwise you'd need the move OV to CR0 instruction before the conditional branch.
> 
> making the operation always "divmid2du." would do the job, but, dang,
> that's 3-in, 3-out.

well...is 3-in 3-out more expensive than needing an extra compare instruction in an important use of divmodqdu?

i guess we could, instead of writing CR0 or OV, just have it compare like so:
cmpl n_hi, d
divmodqdu qhat, d, n_lo, n_hi

the cmpl takes extra resources...in a non-superscalar system takes an extra clock cycle. we can't put it after the divmodqdu (which would save a clock cycle on non-superscalar systems) since n_hi is overwritten, though otoh we likely need to have another copy of n_hi in a register anyway and cmpl could just use that.
Comment 36 Luke Kenneth Casson Leighton 2022-04-26 18:20:57 BST
(In reply to Jacob Lifshay from comment #35)
 
> well...is 3-in 3-out more expensive than needing an extra compare
> instruction in an important use of divmodqdu?

hell yes. that's 6 FFs inside an entire Dependency Matrix Row.
 
> i guess we could, instead of writing CR0 or OV, just have it compare like so:
> cmpl n_hi, d
> divmodqdu qhat, d, n_lo, n_hi

much better / sane
Comment 37 Jacob Lifshay 2022-05-03 08:12:20 BST
See also:
LLVM adding big-int division (fixed-length ints only) to its compiler builtins library, allowing LLVM to correctly compile all int types (even 16384 bit).
https://reviews.llvm.org/D120327
Comment 38 Jacob Lifshay 2022-05-05 09:31:43 BST
Mentioned our work on Knuth's Algorithm D over on LLVM Phabricator where they're currently implementing big-int division:
https://reviews.llvm.org/D120327#3493233

I have been helping review that merge request -- they originally had Algorithm D but switched to a slower simpler algorithm, I convinced them that the library API should support replacing the algorithm with Algorithm D again, which requires passing in sufficiently sized buffers and allowing the inputs to be modified.
Comment 39 Luke Kenneth Casson Leighton 2022-05-13 20:24:27 BST
just realised that another variant of madded, call it say maddd, could be
used for high accuracy dotproduct (bug #555).

it would be:

    prod[0:127] = (RA) * (RB)
    sum[0:127] = (RC || [0]*64) + prod
    RT <- sum[64:127]
    RS <- sum[0:63] # RS is either RC or RT+VL

but the question is: is it worth doing in the integer space?
in FP, using 2 regs to store the intermediate result (FP128),
yes.  but integer? honestly i have no idea
Comment 40 Jacob Lifshay 2022-05-13 21:02:20 BST
(In reply to Luke Kenneth Casson Leighton from comment #39)
> just realised that another variant of madded, call it say maddd, could be
> used for high accuracy dotproduct (bug #555).
> 
> it would be:
> 
>     prod[0:127] = (RA) * (RB)
>     sum[0:127] = (RC || [0]*64) + prod
>     RT <- sum[64:127]
>     RS <- sum[0:63] # RS is either RC or RT+VL

that isn't what you want...you want the existing maddld instruction, but extended to have rt and rc be 128-bit register pairs.
> 
> but the question is: is it worth doing in the integer space?

imho no. the existing integer maddld already returns exact results for dot product, mod 2^64.

the only way we might want a different instruction is for 64x64->128-bit dot product, which would be a 4-in 2-out instruction, which imho is too many regs, otherwise we'd also be ok with bigint*word+bigint muladd which is a 4-in 2-out op which you rejected previously.

> in FP, using 2 regs to store the intermediate result (FP128),

this has the exact same problem as bigint*word+bigint muladd which is a 4-in 2-out op...admittedly it has a better motivation than integer dot product though.

another alternative for fp dot product with higher precision is just doing a vector fmul followed by a tree reduction fadd, which retains most of the precision due to values being rounded O(log N) times rather than O(N) times from a serial reduction.
Comment 41 Luke Kenneth Casson Leighton 2022-05-13 21:53:36 BST
(In reply to Jacob Lifshay from comment #40)
 
> that isn't what you want...you want the existing maddld instruction, but
> extended to have rt and rc be 128-bit register pairs.

damn.  which is 4-in, 2-out.  6 64-bit register hazards is far too much.

> another alternative for fp dot product with higher precision is just doing a
> vector fmul followed by a tree reduction fadd, which retains most of the
> precision due to values being rounded O(log N) times rather than O(N) times
> from a serial reduction.

like it.
Comment 42 Jacob Lifshay 2022-09-19 03:15:07 BST
apparently Rust is getting bigint helper functions too:
tracking issue:
https://github.com/rust-lang/rust/issues/85532

public API:
> // On unsigned integers:
> 
> /// `self + rhs + carry` (full adder)
> fn carrying_add(self, rhs: Self, carry: bool) -> (Self, bool);
> 
> /// `self - rhs - carry` (full "subtractor")
> fn borrowing_sub(self, rhs: Self, carry: bool) -> (Self, bool);
> 
> /// `self * rhs + carry` (multiply-accumulate)
> fn carrying_mul(self, rhs: Self, carry: Self) -> (Self, Self);
> 
> /// `self * rhs` (wide multiplication, same as `self.carrying_mul(rhs, 0)`)
> fn widening_mul(self, rhs: Self) -> (Self, Self);
> 
> 
> // On signed integers:
> 
> /// `self + rhs + carry` (full adder)
> fn carrying_add(self, rhs: Self, carry: bool) -> (Self, bool);
> 
> /// `self - rhs - carry` (full "subtractor")
> fn borrowing_sub(self, rhs: Self, carry: bool) -> (Self, bool);