Bug 676 - FORTRAN MAXLOC SVP64 example
Summary: FORTRAN MAXLOC SVP64 example
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Documentation (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL: https://libre-soc.org/openpower/sv/co...
Depends on: 1183 1236 1239 1246 1248 1034 1249
Blocks: 952 953
  Show dependency treegraph
 
Reported: 2021-08-27 14:36 BST by Luke Kenneth Casson Leighton
Modified: 2024-03-01 00:11 GMT (History)
2 users (show)

See Also:
NLnet milestone: NLnet.2022-08-051.OPF
total budget (EUR) for completion of task and all subtasks: 2000
budget (EUR) for this task, excluding subtasks' budget: 2000
parent task for budget allocation: 953
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
lkcl={amount=1500, submitted=2024-02-12, paid=2024-02-29} jacob=500


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2021-08-27 14:36:53 BST
https://groups.google.com/g/comp.arch/c/5h6jzyESg5s

create example SVP64 assembly for:

int m2(int * const restrict a, int n) 
{ 
   int m, nm; 
   int i; 

   m = INT_MIN; 
   nm = -1; 
   for (i=0; i<n; i++) 
   { 
       if (a[i] > m) 
       { 
           m = a[i]; 
           nm = i; 
       } 
    } 
    return nm; 
}
Comment 1 Luke Kenneth Casson Leighton 2021-08-27 14:58:05 BST
* a data-dependent fail-first sv.cmp tests against
  a[i] <= m.

* SVP64 Branch new CTR Mode would loop back over all elements that
failed, decrementing CTR in the process.

* a data-dependent ffirst sv.cmp a[i] > m creates a predicate
  mask where VL is truncated to exclude the first failed
  element.
Comment 2 Luke Kenneth Casson Leighton 2021-08-27 22:42:10 BST
int m2(int * const restrict a, int n) 
{ 
    int m, nm; 
    int i; 

    m = INT_MIN; 
    nm = -1; 
    i=0; 
    while (i<n) { 
        while (i<n && a[i]<=m) 
            i++; 
        if (a[i] > m) { 
            m = a[i]; 
            nm = i; 
        } 
        i++; 
    } 
    return nm; 
}

bug 
https://groups.google.com/g/comp.arch/c/5h6jzyESg5s/m/FiMO3falAgAJ

while (i<n) { 
    while (i<n && a[i]<=m) 
       i++; 
    if (i < n) { 
       m = a[i]; 
      nm = i; 
    } 
    i++;
}

alternative?

while (i<n) { 
    // skip up to first max
    while (i<n && a[i]<=m)  i++;
    // continue as long as picking new m
    while (i<n && a[i]>m) { 
       m = a[i]; 
       nm = i;
       i++;
    }
}
Comment 3 Jacob Lifshay 2021-09-23 02:46:01 BST
changing milestone to match parent task for budget allocation
Comment 4 Luke Kenneth Casson Leighton 2022-10-01 01:31:41 BST
(In reply to Luke Kenneth Casson Leighton from comment #0)

this one is remarkably simple: use the characteristics
of mapreduce mode to overwrite nm from a parallel
comparison

> int m2(int * const restrict a, int n) 
> { 
>    int m, nm; 
>    int i; 
> 
>    m = INT_MIN; 
>    nm = -1; 

first prepare offsets

    setvl vl=16
    sv.svstep *offs, srcstep # 0 1 2 ... vl-1

and nm-potential

    li nm, -1
    li nmbase, 0

next use CTR mode

>    for (i=0; i<n; i++) 
>    { 

    setvl r1,0,16,0,1,1 # VL=r1=MIN(MAXVL,CTR)

perform the load of the vector data, post-increment update:

     sv.ldux/pi *a, 8(a_ptr)

>        if (a[i] > m) 
>        { 

get max comparison, overlap-loop into m where reg a=m+1

    sv.max. *m, *a, *m

next, these two both use mapreduce mode with predication

>            m = a[i]; 

    sv.ori/mr/m=gt m,*a,0

>            nm = i;

add-overwriting base with vector-offset into nm, yes *scalar* nm,
the "last winner" is the largest index

    sv.add/mr/m=gt nm, nmbase, *offs

>        } 

nmbase must be incremented by vl

    add nmbase, nmbase, r1 # r1 contains copy of vl

branch and subtract VL from CTR if not end of loop

    sv.bc/all 16, *0, loop

>     } 
>     return nm; 

    mr r3, nmbase
    blr

the trick here is the multiple overwrites of nm and m, even though
they are scalar, the usual "termination of looping because scalar result"
is switched *off* in mapreduce mode.
thus due to the predication the last successful conditional overwrite
"wins"

the second trick was to use a base nm which increments by VL, from a
vector of sequential constants.

there will be better/other methods, using data-dependent failfirst
will stop at the first compare-fail but will need "continuation"
(2 nested loops) whereas the mapreduce one is really simple but
relies on WaW hazards to be eliminated
Comment 5 Luke Kenneth Casson Leighton 2022-10-23 16:58:54 BST
we now have "sv.maxs." which should do nicely to get the maximum *and* notify
(Rc=1 vector) whether the max occurred (combined cmp function).
see bug #1057 and https://libre-soc.org/openpower/sv/rfc/ls013/

  if predicated
with a prior scalar-vector cmp, although there would be two vector compares it
would reduce the amount of interdependencies (except on a pre-sorted list)
see bug #915 about combined cmp function
Comment 6 Luke Kenneth Casson Leighton 2023-04-26 16:28:26 BST
also from strncpy, load-post-increment

  48                 # load VL bytes (update r10 addr)
  49                 "sv.lbzup *16, 1(10)",
Comment 7 Luke Kenneth Casson Leighton 2023-04-26 17:02:21 BST
there is another technique which will work here: parallel-reduction
schedule in vertical-first mode.

the reason that vertical-first is needed is because in the middle
of the reduction the index has to be updated: it is 2 bits of
work not 1.
Comment 8 Luke Kenneth Casson Leighton 2023-04-26 19:06:24 BST
continuing
https://groups.google.com/g/comp.arch/c/5h6jzyESg5s/m/SNXgLyP8AgAJ
Comment 9 Andrey Miroshnikov 2023-10-31 10:34:19 GMT
You can find the compiler options for SFFS PowerISA, in the Makefile:
https://git.libre-soc.org/?p=microwatt.git;a=tree;f=hello_world;h=ad4f2ff25fb1ec8c3c0233c8d341a7e5ca6cdb94;hb=refs/heads/verilator_trace
Comment 10 Jacob Lifshay 2023-11-01 01:59:25 GMT
(In reply to Andrey Miroshnikov from comment #9)
> You can find the compiler options for SFFS PowerISA, in the Makefile:
> https://git.libre-soc.org/?p=microwatt.git;a=tree;f=hello_world;
> h=ad4f2ff25fb1ec8c3c0233c8d341a7e5ca6cdb94;hb=refs/heads/verilator_trace

looks like this is exactly what shriya needs, she asked me in a private email (conversation moved here) how to compile a C version of MAXLOC for SFFS.

so, basically, copy the hello_world directory, modify the C source to have the MAXLOC algorithm, and then run the Makefile, the Makefile already has all the -mno-vsx and similar options needed.
Comment 11 Luke Kenneth Casson Leighton 2023-11-01 03:00:25 GMT
(In reply to Jacob Lifshay from comment #10)
> (In reply to Andrey Miroshnikov from comment #9)
> > You can find the compiler options for SFFS PowerISA, in the Makefile:
> > https://git.libre-soc.org/?p=microwatt.git;a=tree;f=hello_world;
> > h=ad4f2ff25fb1ec8c3c0233c8d341a7e5ca6cdb94;hb=refs/heads/verilator_trace

total overkill andrey. shriya will not know a fraction of these.

> looks like this is exactly what shriya needs, she asked me in a private
> email (conversation moved here) how to compile a C version of MAXLOC for
> SFFS.
> 
> so, basically, copy the hello_world directory, 

too complex.

> modify the C source to have
> the MAXLOC algorithm, and then run the Makefile, the Makefile already has
> all the -mno-vsx and similar options needed.

waaay too complex.

like the knuth examples. standalone command. no Makefile.

   gcc  -Os -mno-vsx -mno-altivec maxloc.c

jacob can you make sure i got that right. bare minimum options.
no warnings no "pick this standard", nothing. absolute bare minimum
options.
Comment 12 Jacob Lifshay 2023-11-01 03:16:41 GMT
(In reply to Luke Kenneth Casson Leighton from comment #11)
> like the knuth examples.

honestly those need Makefiles (and comments saying which one we use), since they have been mis-interpreted before when people were trying to figure out which #define is used (since there's like 10 options). We're using the last one, but they read an earlier one and started basing instruction proposals off it...
https://bugs.libre-soc.org/show_bug.cgi?id=1192#c1

Also because I had to spend several minutes trying to figure out how to build them with the right #defines since I forgot. 

> standalone command. no Makefile.

I think it should have a Makefile even if it's very simple, since that tells others how to build it and they don't have to type or copy/paste in the full command every time. the following should be sufficient (make sure indentation is tabs, not spaces, otherwise it won't work):

.PHONY: all clean

all: maxloc

maxloc: maxloc.c
    gcc -Os -mno-vsx -mno-altivec maxloc.c -o maxloc

clean:
    rm -rf maxloc
    
> 
>    gcc  -Os -mno-vsx -mno-altivec maxloc.c
> 
> jacob can you make sure i got that right.

that looks right, assuming you want a binary named a.out.
Comment 13 Luke Kenneth Casson Leighton 2023-11-01 11:44:13 GMT
(In reply to Jacob Lifshay from comment #12)
> (In reply to Luke Kenneth Casson Leighton from comment #11)
> > like the knuth examples.
> 
> honestly those need Makefiles (and comments saying which one we use), since
> they have been mis-interpreted before when people were trying to figure out
> which #define is used (since there's like 10 options). We're using the last
> one, but they read an earlier one and started basing instruction proposals
> off it...
> https://bugs.libre-soc.org/show_bug.cgi?id=1192#c1
> 
> Also because I had to spend several minutes trying to figure out how to
> build them with the right #defines since I forgot. 
> 
> > standalone command. no Makefile.
> 
> I think it should have a Makefile even if it's very simple

the command is going in the tutorial, but yes good idea.

> .PHONY: all clean
> 
> all: maxloc

too complex. too many lines and targets.  this is intended as a REAL
drastic simple tutorial.

shriya only use these two lines in a file named "Makefile".
as jacob says make sure the indent is a tab not spaces.

maxloc: maxloc.c
     gcc -Os -mno-vsx -mno-altivec maxloc.c -o maxloc


> >    gcc  -Os -mno-vsx -mno-altivec maxloc.c
> > 
> > jacob can you make sure i got that right.
> 
> that looks right, assuming you want a binary named a.out.

yes. short. simple.
Comment 14 shriya.sharma 2023-11-01 11:54:12 GMT
Alright.
Thanks
Comment 15 shriya.sharma 2023-11-01 16:35:30 GMT
(In reply to Jacob Lifshay from comment #10)
> (In reply to Andrey Miroshnikov from comment #9)
> > You can find the compiler options for SFFS PowerISA, in the Makefile:
> > https://git.libre-soc.org/?p=microwatt.git;a=tree;f=hello_world;
> > h=ad4f2ff25fb1ec8c3c0233c8d341a7e5ca6cdb94;hb=refs/heads/verilator_trace
> 
> looks like this is exactly what shriya needs, she asked me in a private
> email (conversation moved here) how to compile a C version of MAXLOC for
> SFFS.
> 
> so, basically, copy the hello_world directory, modify the C source to have
> the MAXLOC algorithm, and then run the Makefile, the Makefile already has
> all the -mno-vsx and similar options needed.
Okay, Thanks Jacob.
Comment 16 Luke Kenneth Casson Leighton 2023-11-01 17:28:18 GMT
(In reply to shriya.sharma from comment #15)

> Okay, Thanks Jacob.

(shriya, reminder: please trim all excess context. we can all
see and follow and read the entire thread: repetitions of exactly
what is in the comment above is just noise.  i have edited
comment #15 to demonstrate what you should do, *and* have trimmed
excess context above as well. can you please read this and absorb
it, but pay special attention to the "trim" section
https://people.kernel.org/tglx/notes-about-netiquette
).

now, please re-read comment #11, i appreciate jacob is helping but
he (sorry referring to you 3rd peron here jacob) tends to make things
much more complex - and robust - which would be extremely good *if*
that were necessary (which it often will be in many cases throughout
the project)

but not in this case. this is a tutorial/demo and the rules for
a tutorial/demo are to keep it STRICTLY to the bare minimum in order
to not have a high noise-to-signal ratio.

* we want to demonstrate the algorithm in SVP64 assembler vs c and python.
* the algorithm itself is only 12 lines long.
* the Makefile that jacob wrote is *longer than the algorithm*!
* the one that Andrey referred to is even longer and more complex.

we are not demonstrating "how to write perfect Makefiles"!
Comment 17 shriya.sharma 2023-11-08 16:21:22 GMT
link to the standalone c program for FORTRAN MAXLOC:

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=852dde97c78508d0ebf6f30a9ceed7b1bc8dc9f1
Comment 18 Andrey Miroshnikov 2023-11-08 16:41:47 GMT
(In reply to shriya.sharma from comment #17)
> link to the standalone c program for FORTRAN MAXLOC:
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=514a48499f05e1cbf444d35f1ab7f7fd863c5e9f

Forgot to remove unused include.

We were getting a random error where the output was:
(ls2)am@tpam430:~/src/openpower-isa/maxloc$ ./maxloc 
Index of the maximum value in an array is: 8

Instead of:
(ls2)am@tpam430:~/src/openpower-isa/maxloc$ ./maxloc 
Index of the maximum value in an array is: 6
(Which is the correct result)

I only managed to get the incorrect result of 8 once; subsequent clean and compilation fixed the issue.
Comment 19 Luke Kenneth Casson Leighton 2023-11-08 20:28:35 GMT
i got rid of a *lot* of unnecessary crap. to repeat, and please listen:
this is intended as a *short* demonstration.  const: crap. limits.h: crap.
long "perfect" Makefile: i *already made it clear* to keep it to the bare
minimum. i really, *really* do not like being ignored, it is very disrespectful.
Comment 21 Jacob Lifshay 2023-11-16 22:21:12 GMT
i noticed that changing the initial value of m to 0 instead of INT_MIN is incorrect, since the algorithm uses signed integers. m starting at 0 (or actually any int at all) is incorrect since the algorithm ignores all array elements <= m, thereby giving the wrong answer for input [-5, -7, -6] or even [-1, 0, -2] -- both of those return -1 even though the correct answer is 0 and 1 respectively.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=maxloc/maxloc.c;h=6becf3ef26fb8f71e50dbd0bcb774c6d49d307c2;hb=f9d2e36493632a179e08f5eb3569b1f9f05d05d1
Comment 22 Jacob Lifshay 2023-11-16 22:25:45 GMT
(In reply to Jacob Lifshay from comment #21)
> i noticed that changing the initial value of m to 0 instead of INT_MIN is
> incorrect, since the algorithm uses signed integers. m starting at 0 (or
> actually any int at all)

a correct implementation starts with:
if(n == 0)
    return -1;
m = a[0];
nm = 0;

if you assume the input isn't empty, you can remove the if(n == 0) return -1;
Comment 23 Luke Kenneth Casson Leighton 2023-11-16 22:44:03 GMT
(In reply to Jacob Lifshay from comment #21)
> i noticed that changing the initial value of m to 0 instead of INT_MIN is
> incorrect, since the algorithm uses signed integers.

not in the least bit bothered. this is a demo not a hardened library implementation.

removing INT_MIN which should not have been added reduces the complexity

(In reply to Jacob Lifshay from comment #22)

> a correct implementation starts with:

again not relevant in the least bit: it changes the scope which
is not up for negotiation.

i repeat again this a *demo*. please listen. there is no point in
you providing "enhancements" as all they do is increase the number
of lines which reduces readability FOR A DEMO.

please LISTEN when i state very clearly what the scope is.

if you are tempted to make suggestions that increase the number
of lines *please refrain from writing anything at all* as all it
does is aggravate.

if on the other hand you find bugs or wish to help move this forward
then you are most welcome.
Comment 25 Luke Kenneth Casson Leighton 2023-11-17 14:27:59 GMT
(In reply to Luke Kenneth Casson Leighton from comment #2)

in-register version

first prepare offsets

    setvl vl=8
    sv.svstep *offs, srcstep # 0 1 2 ... vl-1

and nm-potential

    li nm, -1
    li nmbase, 0
    li idx, 0    # start at first

> while (i<n) { 
>     // skip up to first max
>     while (i<n && a[i]<=m)  i++;

get max comparison, overlap-loop into m where reg a=m+1,
and terminate at the first

    sv.max./ff=gt/vli *m, *a, idx

now the search will terminate at the first failure,
truncating VL to point at it.

next we must discard the numbers already analysed:

    sv.addi/sm=lt *m, *m, 0

next, these two both use mapreduce mode with predication.

>     // continue as long as picking new m
>     while (i<n && a[i]>m) { 

use the same max-trick, terminating this time at the greatest first number

    sv.max./ff=lt/vli *m, *a, idx

>        m = a[i]; 



>        nm = i;

add-overwriting base with vector-offset into nm, yes *scalar* nm,
the "last winner" is the largest index. reverse-gear would make
this easier (pick the first success)

    sv.add/mr/m=gt nm, nmbase, *offs

>        i++;
>     }
> }
Comment 26 Luke Kenneth Casson Leighton 2023-11-17 14:39:55 GMT
(In reply to Luke Kenneth Casson Leighton from comment #25)

> use the same max-trick, terminating this time at the greatest first number
> 
>     sv.max./ff=lt *m, *a, idx

arse, i just realised, this needs to be:

     sv.max./ff=lt idx, *a, idx

and, jacob, just like what you spotted earlier, with the other algorithm,
this will terminate at the first write into RT. bug #1183
https://bugs.libre-soc.org/show_bug.cgi?id=1183#c3
Comment 27 Luke Kenneth Casson Leighton 2023-12-09 09:10:35 GMT
first unit test added which demonstrates new "single" loop capability
on dd-ffirst

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=b077590
Comment 28 Luke Kenneth Casson Leighton 2023-12-16 10:14:14 GMT
first cut/paste of maxloc test from ddffirst one

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=465b96f13b6
Comment 30 Jacob Lifshay 2024-02-02 21:07:09 GMT
maxloc has been merged to master, but whoever merged it forgot to check that it didn't break CI:

https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/5226784

=========================== short test summary info ============================
[case_divmod_knuth_algorithm_d_512x256_to_256x256] (d='0x1_0000_0000', n='0xffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_0000_0000', q='0xffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff', r='0x0') SUBFAIL src/openpower/decoder/isa/test_aaa_caller_svp64_powmod.py::TestPowMod15::test - TypeError: '<=' not supported between instances of 'NoneType' and 'int'
<another 9 instances of basically the same thing>
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 10 failures !!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 10 failures !!!!!!!!!!!!
= 10 failed, 87 passed, 1 skipped, 2 xfailed, 170 warnings in 445.69s (0:07:25)
Comment 31 Jacob Lifshay 2024-02-02 21:08:22 GMT
                # low 2 LSBs (CR field selector) remain same, CR num extended
>               assert regnum <= 7, "sigh, TODO, 128 CR fields"
E               TypeError: '<=' not supported between instances of 'NoneType' and 'int'

src/openpower/decoder/isa/caller.py:1666: TypeError
Comment 32 Luke Kenneth Casson Leighton 2024-02-02 22:23:12 GMT
(In reply to Jacob Lifshay from comment #30)
> maxloc has been merged to master, but whoever merged it forgot to check that
> it didn't break CI:

yes i couldn't.  complicated. i needed work from another branch.
rebase screwed up. a *lot*.
Comment 33 Jacob Lifshay 2024-02-02 22:33:50 GMT
(In reply to Luke Kenneth Casson Leighton from comment #32)
> (In reply to Jacob Lifshay from comment #30)
> > maxloc has been merged to master, but whoever merged it forgot to check that
> > it didn't break CI:
> 
> yes i couldn't.

yes you can, i'm not sure if you realized it but CI tests all branches and tags, so just push the changes to a branch and check it once it runs.

e.g. CI on the maxloc_676 branch with essentially the same errors:
https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/5226783
Comment 34 Jacob Lifshay 2024-02-02 22:41:10 GMT
(In reply to Luke Kenneth Casson Leighton from comment #32)
>  i needed work from another branch.

rebase one branch on the other, no need to mess with master.

> rebase screwed up. a *lot*.

I basically always run git log before pushing, it helps catch stuff like that.
Comment 35 Luke Kenneth Casson Leighton 2024-02-02 23:45:02 GMT
(In reply to Jacob Lifshay from comment #34)

> rebase one branch on the other, no need to mess with master.

not when there are inter-related developments and conflicts in all three.

> > rebase screwed up. a *lot*.
> 
> I basically always run git log before pushing, it helps catch stuff like
> that.

long story. half a dozen conflicts spread across both branches.
Comment 36 Luke Kenneth Casson Leighton 2024-02-02 23:46:32 GMT
(In reply to Jacob Lifshay from comment #30)

> https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/5226784

    An error occurred while fetching the job.

i'll have a look tomorrow
Comment 37 Luke Kenneth Casson Leighton 2024-02-03 08:30:31 GMT
(In reply to Luke Kenneth Casson Leighton from comment #36)

> i'll have a look tomorrow

sorted a lot of them - running on my laptop drained 80% of the
inverter/battery as it was *over two hours* so i will not be
running the powmod tests. jacob when adding unit tests you
really do need to be mindful that i am in severely resource-limited
circumstances? (no, remote-server-login is not "the solution")
Comment 38 Luke Kenneth Casson Leighton 2024-02-05 21:25:21 GMT
hooray! used new crternlogi to reduce two cr field instructions down
to one.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=0ec8094
Comment 39 Luke Kenneth Casson Leighton 2024-02-05 21:41:41 GMT
bit of a cleanup, but basically this is all good now.
11 instructions. not bad.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=1f5b60672
Comment 40 Luke Kenneth Casson Leighton 2024-02-06 13:53:12 GMT
starting to update page, added assembler

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=986e7d0
Comment 41 Luke Kenneth Casson Leighton 2024-02-12 10:32:17 GMT
done writeup of maxloc. pretty efficient inner loop, 11 instructions.
outer loop would be another... 4? 

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=0bd5a7b00ec
Comment 42 Luke Kenneth Casson Leighton 2024-02-13 12:10:40 GMT
found a way to reduce the number of instructions. down to 10.

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=7f45d3