a recipe using prefix sum REMAP is needed that does polynomial interpolation of non-linear equations, taylor series etc.
https://github.com/jiahao/parallel-prefix
<markos_> I've done a similar method in the past <programmerjake> idk, but parallel prefix sum and parallel tree reduction can be used for polynomial evaluation: <programmerjake> reduce(element_wise_mul(c, prefix_sum([1] + splat(x), op=mul)), op=add) == <programmerjake> c[0]+c[1]*x+c[2]*x^2+c[3]*x^3+c[4]*x^4... <programmerjake> though it tends to have lower precision than using fma because of rounding before adding
(In reply to Luke Kenneth Casson Leighton from comment #2) > <markos_> I've done a similar method in the past > <programmerjake> idk, but parallel prefix sum and parallel tree reduction > can be used for polynomial evaluation: > <programmerjake> reduce(element_wise_mul(c, prefix_sum([1] + splat(x), > op=mul)), op=add) == > <programmerjake> c[0]+c[1]*x+c[2]*x^2+c[3]*x^3+c[4]*x^4... > <programmerjake> though it tends to have lower precision than using fma > because of rounding before adding ok writing this out myself so that my brain absorbs it * start with: [1, x, x, x, x...] * perform prefix sum multiply (parallel is fine) [1, x, x^2, x^3, ....] * take vector of c [c0, c1, c2, c3, c4...] * straight multiply: [c0, c1*x, c2*x^2, ....] * perform reduction c0 + c1*x + c2*x^2 ... ok that is dead easy. so now what about parallel reduction using fmac? let us first try mapreduce on fmac: sv.madd/mr res, *c, *xv, res that looks perfectly fine, and can get accuracy from fmac. now parallel reduction, res would have to be a vector, but i don't believe it works, because *c (or *xv) would end up getting multiplied in more than once. unless done as Vertical-First.
(In reply to Luke Kenneth Casson Leighton from comment #3) > (In reply to Luke Kenneth Casson Leighton from comment #2) > > <markos_> I've done a similar method in the past > > <programmerjake> idk, but parallel prefix sum and parallel tree reduction > > can be used for polynomial evaluation: > > <programmerjake> reduce(element_wise_mul(c, prefix_sum([1] + splat(x), > > op=mul)), op=add) == > > <programmerjake> c[0]+c[1]*x+c[2]*x^2+c[3]*x^3+c[4]*x^4... > > <programmerjake> though it tends to have lower precision than using fma > > because of rounding before adding > > ok writing this out myself so that my brain absorbs it > > * start with: > [1, x, x, x, x...] > * perform prefix sum multiply (parallel is fine) > [1, x, x^2, x^3, ....] > * take vector of c > [c0, c1, c2, c3, c4...] > * straight multiply: > [c0, c1*x, c2*x^2, ....] > * perform reduction > c0 + c1*x + c2*x^2 ... parallel reduction is fine here too -- reduce using fadd, though as noted it gets less accuracy than using fma (because fma doesn't round between the mul and add and because higher powers of x tend to give monomials (includes coefficients) that are smaller, so adding a smaller approximate value to a larger exact value (the coefficient) gives a value with higher relative precision). (edit: add forgotten parenthesis)
(In reply to Jacob Lifshay from comment #4) > parallel reduction is fine here too -- reduce using fadd, though as noted it > gets less accuracy than using fma (because fma doesn't round between the mul > and add and because higher powers of x tend to give monomials (includes > coefficients) that are smaller, so adding a smaller approximate value to a > larger exact value (the coefficient) gives a value with higher relative > precision). sounds to me like /mrr (reverse-gear) would be sensible to use, at the end where the tinier values would accumulate with better accuracy. exactly the kind of thing to put in an app note.