Bug 676 - FORTRAN MAXLOC SVP64 example
Summary: FORTRAN MAXLOC SVP64 example
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Documentation (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL: https://libre-soc.org/openpower/sv/co...
Depends on:
Blocks: 952 953
  Show dependency treegraph
 
Reported: 2021-08-27 14:36 BST by Luke Kenneth Casson Leighton
Modified: 2023-04-26 19:51 BST (History)
1 user (show)

See Also:
NLnet milestone: NLnet.2022-08-051.OPF
total budget (EUR) for completion of task and all subtasks: 2000
budget (EUR) for this task, excluding subtasks' budget: 2000
parent task for budget allocation: 953
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2021-08-27 14:36:53 BST
https://groups.google.com/g/comp.arch/c/5h6jzyESg5s

create example SVP64 assembly for:

int m2(int * const restrict a, int n) 
{ 
   int m, nm; 
   int i; 

   m = INT_MIN; 
   nm = -1; 
   for (i=0; i<n; i++) 
   { 
       if (a[i] > m) 
       { 
           m = a[i]; 
           nm = i; 
       } 
    } 
    return nm; 
}
Comment 1 Luke Kenneth Casson Leighton 2021-08-27 14:58:05 BST
* a data-dependent fail-first sv.cmp tests against
  a[i] <= m.

* SVP64 Branch new CTR Mode would loop back over all elements that
failed, decrementing CTR in the process.

* a data-dependent ffirst sv.cmp a[i] > m creates a predicate
  mask where VL is truncated to exclude the first failed
  element.
Comment 2 Luke Kenneth Casson Leighton 2021-08-27 22:42:10 BST
int m2(int * const restrict a, int n) 
{ 
    int m, nm; 
    int i; 

    m = INT_MIN; 
    nm = -1; 
    i=0; 
    while (i<n) { 
        while (i<n && a[i]<=m) 
            i++; 
        if (a[i] > m) { 
            m = a[i]; 
            nm = i; 
        } 
        i++; 
    } 
    return nm; 
}

bug 
https://groups.google.com/g/comp.arch/c/5h6jzyESg5s/m/FiMO3falAgAJ

while (i<n) { 
    while (i<n && a[i]<=m) 
       i++; 
    if (i < n) { 
       m = a[i]; 
      nm = i; 
    } 
    i++;
}

alternative?

while (i<n) { 
    // skip up to first max
    while (i<n && a[i]<=m)  i++;
    // continue as long as picking new m
    while (i<n && a[i]>=m) { 
       m = a[i]; 
       nm = i;
       i++;
    }
}
Comment 3 Jacob Lifshay 2021-09-23 02:46:01 BST
changing milestone to match parent task for budget allocation
Comment 4 Luke Kenneth Casson Leighton 2022-10-01 01:31:41 BST
(In reply to Luke Kenneth Casson Leighton from comment #0)

this one is remarkably simple: use the characteristics
of mapreduce mode to overwrite nm from a parallel
comparison

> int m2(int * const restrict a, int n) 
> { 
>    int m, nm; 
>    int i; 
> 
>    m = INT_MIN; 
>    nm = -1; 

first prepare offsets

    setvl vl=16
    sv.svstep *offs, srcstep # 0 1 2 ... vl-1

and nm-potential

    li nm, -1
    li nmbase, 0

next use CTR mode

>    for (i=0; i<n; i++) 
>    { 

    setvl r1,0,16,0,1,1 # VL=r1=MIN(MAXVL,CTR)

perform the load of the vector data, post-increment update:

     sv.ldux/pi *a, 8(a_ptr)

>        if (a[i] > m) 
>        { 

get max comparison, overlap-loop into m where reg a=m+1

    sv.max. *m, *a, *m

next, these two both use mapreduce mode with predication

>            m = a[i]; 

    sv.ori/mr/m=gt m,*a,0

>            nm = i;

add-overwriting base with vector-offset into nm, yes *scalar* nm,
the "last winner" is the largest index

    sv.add/mr/m=gt nm, nmbase, *offs

>        } 

nmbase must be incremented by vl

    add nmbase, nmbase, r1 # r1 contains copy of vl

branch and subtract VL from CTR if not end of loop

    sv.bc/all 16, *0, loop

>     } 
>     return nm; 

    mr r3, nmbase
    blr

the trick here is the multiple overwrites of nm and m, even though
they are scalar, the usual "termination of looping because scalar result"
is switched *off* in mapreduce mode.
thus due to the predication the last successful conditional overwrite
"wins"

the second trick was to use a base nm which increments by VL, from a
vector of sequential constants.

there will be better/other methods, using data-dependent failfirst
will stop at the first compare-fail but will need "continuation"
(2 nested loops) whereas the mapreduce one is really simple but
relies on WaW hazards to be eliminated
Comment 5 Luke Kenneth Casson Leighton 2022-10-23 16:58:54 BST
we now have "sv.maxs." which should do nicely to get the maximum *and* notify
(Rc=1 vector) whether the max occurred (combined cmp function).  if predicated
with a prior scalar-vector cmp, although there would be two vector compares it
would reduce the amount of interdependencies (except on a pre-sorted list)
see bug #915 about combined cmp function
Comment 6 Luke Kenneth Casson Leighton 2023-04-26 16:28:26 BST
also from strncpy, load-post-increment

  48                 # load VL bytes (update r10 addr)
  49                 "sv.lbzu/pi *16, 1(10)",
Comment 7 Luke Kenneth Casson Leighton 2023-04-26 17:02:21 BST
there is another technique which will work here: parallel-reduction
schedule in vertical-first mode.

the reason that vertical-first is needed is because in the middle
of the reduction the index has to be updated: it is 2 bits of
work not 1.
Comment 8 Luke Kenneth Casson Leighton 2023-04-26 19:06:24 BST
continuing
https://groups.google.com/g/comp.arch/c/5h6jzyESg5s/m/SNXgLyP8AgAJ