Write basic SV implementations, run on the simulator. reference source: https://ffmpeg.org/doxygen/2.8/mpegaudiodsp__template_8c_source.html
x86 reference numbers for SSE: apply_window 3.75x imdct36 3.15x
Lauri let me know, what the minimum operations you need in the simulator are. if it is just for example FP add, FP mul, FP sub, FP LOAD and FP STORE i can likely get that done quite quickly. (Also bear in mind, the basic SV implementation, due to the way that Vector ISAs work, is almost 100% likely to be the optimised version: the focus is more on the "basic" version informing us at a very early stage what *instructions* are needed, more than "how optimised the actual assembler is")
You can look at the asm as soon as we get the repo going, would that work? Yes, the optimized version includes new instructions as the bug says. If it turns out no new instructions are useful, the tests and measurements to verify that cover the optimized bug.
(clarification: of the C baseline, I haven't yet started the SV version)
(In reply to Luke Kenneth Casson Leighton from comment #2) > are. if it is just for example FP add, FP mul, FP sub, FP LOAD and FP STORE > i can likely get that done quite quickly the preliminary ops are now in the simulator, and should be functional as long as exceptions, CR1, and use of FPSCR are avoided for now. if CR1 or other status bits are needed do let me know.
apply_window working well http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-June/003164.html
discussion of imdct36 http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-June/003178.html
I have not received several mails in that thread. Possibly others missing too. Not in spam.
On the questions such as "for temporaries t0 t1 etc were you planning to drop those into stack, with an offset 16, then *reread* them back with an offset of 4?" or the last mail's five loop, I can't answer those until I have time to look at the thing properly.
(In reply to cand from comment #8) > I have not received several mails in that thread. Possibly others missing > too. Not in spam. very weird, they're going out. i checked server logs. (In reply to cand from comment #9) > On the questions such as "for temporaries t0 t1 etc were > you planning to drop those into stack, with an offset 16, then > *reread* them back with an offset of 4?" or the last mail's five loop, I > can't answer those until I have time to look at the thing properly. ok. well i spent several days staring at imdct36, i am confident now that no more needs to be added to get the "basic" version running. (please ignore the FFT work i am currently doing, that is for the optimised version) * sv.add/mrr/m=r30 you can use for adding in *reverse* order, setting r30=0b010101 this will cover the for-loop for (i=17; i >= ... i-=2) the first big procesing loop, for i 0 to 1, VL can be set to 2 the second, VL can be set to *five* but the 2nd of each group of operations, a predicate mask can be set to 0b01111. my feeling is, it may be best to actually morph the c code a little, to make it clearer. if you can add a small test Makefile plus main.c like for the mp3_0 test, reading and comparing data, i am happy to do some code-morphs that remove the very last block (out of the forloop 0..3) and increase the loop to 0..4 with conditional code.
> if you can add a small test Makefile plus main.c like for the mp3_0 test, reading and comparing data That already exists?
On the mails, everything from Monday onwards was delivered correctly.
(In reply to cand from comment #11) > > if you can add a small test Makefile plus main.c like for the mp3_0 test, reading and comparing data > That already exists? for the c code, no. lkcl@fizzy:~/src/libresoc/openpower-isa$ !fi find . -name "Makefile" ./src/openpower/simulator/qemu_test/Makefile ./src/test/basic_pypowersim/Makefile ./src/test/basic_svp64_trans/Makefile ./src/test/basic_pypowersim_fp/Makefile ./Makefile ./media/Makefile media/Makefile does not contain any reference to apply_window_standalone.c or to imdct36_standalone.c lkcl@fizzy:~/src/libresoc/openpower-isa/media$ ls data/audio/mp3/mp3_0_data/ a.out buf3000 buf8000 samples1000 samples6000 apply_window_standalone.c buf4000 buf9000 samples2000 samples7000 buf0 buf5000 main.c samples3000 samples8000 buf1000 buf6000 out_samples0 samples4000 samples9000 buf2000 buf7000 samples0 samples5000 win0 no Makefile. lkcl@fizzy:~/src/libresoc/openpower-isa/media$ ls data/audio/mp3/mp3_1_data/ beforeout1 beforeout5 buf20 in14 main.c out3 win19 beforeout2 buf14 imdct36_standalone.c in3 out18 win12 win8 ah! found main.c in there. excellent ok. i can work with that. what i propose is, to copy imdct36_standalone.c and main.c into openpower-isa/media/audio/mp3/mp3_1/ and commit them then adjust them to work relative to the media/Makefile directory then, make a *second* copy which is "a little bit more vector-like" sound good?
They weren't intended for running the media tests, just as a sanity check the data matches on x86, etc. So I wouldn't bind them into the main makefile tests at least. If you want to store modified versions, maybe with their own makefile target, that's fine.
(In reply to cand from comment #14) > They weren't intended for running the media tests, just as a sanity check > the data matches on x86, etc. yehyeh. except, the idea for incrementing the loop size from 4 to 5 and excluding the bits that aren't needed with a predicate 0b01111, that idea needs checking. i mean the section at the end s0 = tmp[16]; s1 = MULH3(tmp[17], icos36h[4], 2); ... all of that can go inside the loop with for j = 0 j < **5**... but its partner part can't (icos36[8-j) and s2 + s3), that would over-run the array, so has to be predicated, hence 0b01111. > So I wouldn't bind them into the main makefile > tests at least. If you want to store modified versions, maybe with their own > makefile target, that's fine. ok good call.
done https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=ea780569b30b81b07e20e4cba53673203df24af2 * pretend-predicate 0b01111 added * loop count increased to 5 * MULH3 on icos36h works out to be exactly as the hard-coded constants 16, 17, 9+4, 8-4 etc. * hard-coded block now removed. interestingly it wasn't as straightforward as i first imagined, but retrospectively it's logical: s1 and s2 are initially created from t1 and t0, the name-changes of the variables in the last (now deleted) block confused me. the predicate mask can be hard-coded to 0b01111 with an immediate, which results in t1, s2 and t3 being set to zero on that last block. i'll add some comments.