https://groups.google.com/g/comp.arch/c/5h6jzyESg5s create example SVP64 assembly for: int m2(int * const restrict a, int n) { int m, nm; int i; m = INT_MIN; nm = -1; for (i=0; i<n; i++) { if (a[i] > m) { m = a[i]; nm = i; } } return nm; }
* a data-dependent fail-first sv.cmp tests against a[i] <= m. * SVP64 Branch new CTR Mode would loop back over all elements that failed, decrementing CTR in the process. * a data-dependent ffirst sv.cmp a[i] > m creates a predicate mask where VL is truncated to exclude the first failed element.
int m2(int * const restrict a, int n) { int m, nm; int i; m = INT_MIN; nm = -1; i=0; while (i<n) { while (i<n && a[i]<=m) i++; if (a[i] > m) { m = a[i]; nm = i; } i++; } return nm; } bug https://groups.google.com/g/comp.arch/c/5h6jzyESg5s/m/FiMO3falAgAJ while (i<n) { while (i<n && a[i]<=m) i++; if (i < n) { m = a[i]; nm = i; } i++; } alternative? while (i<n) { // skip up to first max while (i<n && a[i]<=m) i++; // continue as long as picking new m while (i<n && a[i]>=m) { m = a[i]; nm = i; i++; } }
changing milestone to match parent task for budget allocation
(In reply to Luke Kenneth Casson Leighton from comment #0) this one is remarkably simple: use the characteristics of mapreduce mode to overwrite nm from a parallel comparison > int m2(int * const restrict a, int n) > { > int m, nm; > int i; > > m = INT_MIN; > nm = -1; first prepare offsets setvl vl=16 sv.svstep *offs, srcstep # 0 1 2 ... vl-1 and nm-potential li nm, -1 li nmbase, 0 next use CTR mode > for (i=0; i<n; i++) > { setvl r1,0,16,0,1,1 # VL=r1=MIN(MAXVL,CTR) perform the load of the vector data, post-increment update: sv.ldux/pi *a, 8(a_ptr) > if (a[i] > m) > { get max comparison, overlap-loop into m where reg a=m+1 sv.max. *m, *a, *m next, these two both use mapreduce mode with predication > m = a[i]; sv.ori/mr/m=gt m,*a,0 > nm = i; add-overwriting base with vector-offset into nm, yes *scalar* nm, the "last winner" is the largest index sv.add/mr/m=gt nm, nmbase, *offs > } nmbase must be incremented by vl add nmbase, nmbase, r1 # r1 contains copy of vl branch and subtract VL from CTR if not end of loop sv.bc/all 16, *0, loop > } > return nm; mr r3, nmbase blr the trick here is the multiple overwrites of nm and m, even though they are scalar, the usual "termination of looping because scalar result" is switched *off* in mapreduce mode. thus due to the predication the last successful conditional overwrite "wins" the second trick was to use a base nm which increments by VL, from a vector of sequential constants. there will be better/other methods, using data-dependent failfirst will stop at the first compare-fail but will need "continuation" (2 nested loops) whereas the mapreduce one is really simple but relies on WaW hazards to be eliminated
we now have "sv.maxs." which should do nicely to get the maximum *and* notify (Rc=1 vector) whether the max occurred (combined cmp function). if predicated with a prior scalar-vector cmp, although there would be two vector compares it would reduce the amount of interdependencies (except on a pre-sorted list) see bug #915 about combined cmp function
also from strncpy, load-post-increment 48 # load VL bytes (update r10 addr) 49 "sv.lbzu/pi *16, 1(10)",
there is another technique which will work here: parallel-reduction schedule in vertical-first mode. the reason that vertical-first is needed is because in the middle of the reduction the index has to be updated: it is 2 bits of work not 1.
continuing https://groups.google.com/g/comp.arch/c/5h6jzyESg5s/m/SNXgLyP8AgAJ