Bug 644 - MP3: Basic SV impl
Summary: MP3: Basic SV impl
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: cand
URL:
Depends on: 647
Blocks: 218
  Show dependency treegraph
 
Reported: 2021-05-14 13:19 BST by cand
Modified: 2021-06-28 19:26 BST (History)
2 users (show)

See Also:
NLnet milestone: NLNet.2019.Video
total budget (EUR) for completion of task and all subtasks: 225
budget (EUR) for this task, excluding subtasks' budget: 225
parent task for budget allocation: 218
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description cand 2021-05-14 13:19:11 BST
Write basic SV implementations, run on the simulator.

reference source:
https://ffmpeg.org/doxygen/2.8/mpegaudiodsp__template_8c_source.html
Comment 1 cand 2021-05-14 13:33:38 BST
x86 reference numbers for SSE:
apply_window 3.75x
imdct36 3.15x
Comment 2 Luke Kenneth Casson Leighton 2021-05-14 16:30:31 BST
Lauri let me know, what the minimum operations you need in the simulator
are.  if it is just for example FP add, FP mul, FP sub, FP LOAD and FP STORE
i can likely get that done quite quickly.

(Also bear in mind, the basic SV implementation, due to the way that
Vector ISAs work, is almost 100% likely to be the optimised version: the
focus is more on the "basic" version informing us at a very early stage
what *instructions* are needed, more than "how optimised the actual
assembler is")
Comment 3 cand 2021-05-14 16:36:38 BST
You can look at the asm as soon as we get the repo going, would that work?

Yes, the optimized version includes new instructions as the bug says. If it turns out no new instructions are useful, the tests and measurements to verify that cover the optimized bug.
Comment 4 cand 2021-05-14 16:52:04 BST
(clarification: of the C baseline, I haven't yet started the SV version)
Comment 5 Luke Kenneth Casson Leighton 2021-05-16 16:48:39 BST
(In reply to Luke Kenneth Casson Leighton from comment #2)
> are.  if it is just for example FP add, FP mul, FP sub, FP LOAD and FP STORE
> i can likely get that done quite quickly

the preliminary ops are now in the simulator, and should be functional as long as exceptions, CR1, and use of FPSCR are avoided for now.

if CR1 or other status bits are needed do let me know.
Comment 6 Luke Kenneth Casson Leighton 2021-06-17 11:31:30 BST
apply_window working well
http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-June/003164.html
Comment 7 Luke Kenneth Casson Leighton 2021-06-19 17:03:28 BST
discussion of imdct36
http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-June/003178.html
Comment 8 cand 2021-06-19 17:25:45 BST
I have not received several mails in that thread. Possibly others missing too. Not in spam.
Comment 9 cand 2021-06-19 17:31:50 BST
On the questions such as "for temporaries t0 t1 etc were
you planning to drop those into stack, with an offset 16, then
*reread* them back with an offset of 4?" or the last mail's five loop, I can't answer those until I have time to look at the thing properly.
Comment 10 Luke Kenneth Casson Leighton 2021-06-23 14:23:22 BST
(In reply to cand from comment #8)
> I have not received several mails in that thread. Possibly others missing
> too. Not in spam.

very weird, they're going out. i checked server logs.

(In reply to cand from comment #9)
> On the questions such as "for temporaries t0 t1 etc were
> you planning to drop those into stack, with an offset 16, then
> *reread* them back with an offset of 4?" or the last mail's five loop, I
> can't answer those until I have time to look at the thing properly.

ok.

well i spent several days staring at imdct36, i am confident now that no more needs to be added to get the "basic" version running. (please ignore the FFT work i am currently doing, that is for the optimised version)

* sv.add/mrr/m=r30 you can use for adding in *reverse* order, setting r30=0b010101

this will cover the for-loop for (i=17; i >= ... i-=2)

the first big procesing loop, for i 0 to 1, VL can be set to 2

the second, VL can be set to *five* but the 2nd of each group of operations, a predicate mask can be set to 0b01111.

my feeling is, it may be best to actually morph the c code a little, to make it clearer.

if you can add a small test Makefile plus main.c like for the mp3_0 test, reading and comparing data, i am happy to do some code-morphs that remove the very last block (out of the forloop 0..3) and increase the loop to 0..4 with conditional code.
Comment 11 cand 2021-06-23 15:02:30 BST
> if you can add a small test Makefile plus main.c like for the mp3_0 test, reading and comparing data
That already exists?
Comment 12 cand 2021-06-23 15:49:52 BST
On the mails, everything from Monday onwards was delivered correctly.
Comment 13 Luke Kenneth Casson Leighton 2021-06-23 16:15:04 BST
(In reply to cand from comment #11)
> > if you can add a small test Makefile plus main.c like for the mp3_0 test, reading and comparing data
> That already exists?

for the c code, no.


lkcl@fizzy:~/src/libresoc/openpower-isa$ !fi
find . -name "Makefile"
./src/openpower/simulator/qemu_test/Makefile
./src/test/basic_pypowersim/Makefile
./src/test/basic_svp64_trans/Makefile
./src/test/basic_pypowersim_fp/Makefile
./Makefile
./media/Makefile

media/Makefile does not contain any reference to apply_window_standalone.c
or to imdct36_standalone.c

lkcl@fizzy:~/src/libresoc/openpower-isa/media$ ls data/audio/mp3/mp3_0_data/
a.out                      buf3000  buf8000       samples1000  samples6000
apply_window_standalone.c  buf4000  buf9000       samples2000  samples7000
buf0                       buf5000  main.c        samples3000  samples8000
buf1000                    buf6000  out_samples0  samples4000  samples9000
buf2000                    buf7000  samples0      samples5000  win0

no Makefile.
lkcl@fizzy:~/src/libresoc/openpower-isa/media$ ls data/audio/mp3/mp3_1_data/
beforeout1   beforeout5  buf20                 in14  main.c  out3   win19
beforeout2   buf14       imdct36_standalone.c  in3   out18   win12  win8

ah!  found main.c in there.  excellent ok.

i can work with that.

what i propose is, to copy imdct36_standalone.c and main.c into
openpower-isa/media/audio/mp3/mp3_1/

and commit them

then adjust them to work relative to the media/Makefile directory

then, make a *second* copy which is "a little bit more vector-like"

sound good?
Comment 14 cand 2021-06-23 16:34:16 BST
They weren't intended for running the media tests, just as a sanity check the data matches on x86, etc. So I wouldn't bind them into the main makefile tests at least. If you want to store modified versions, maybe with their own makefile target, that's fine.
Comment 15 Luke Kenneth Casson Leighton 2021-06-23 16:39:28 BST
(In reply to cand from comment #14)
> They weren't intended for running the media tests, just as a sanity check
> the data matches on x86, etc.

yehyeh.  except, the idea for incrementing the loop size from 4 to 5
and excluding the bits that aren't needed with a predicate 0b01111,
that idea needs checking.

i mean the section at the end
    s0 = tmp[16];
    s1 = MULH3(tmp[17], icos36h[4], 2);
    ...

all of that can go inside the loop with for j = 0 j < **5**...
but its partner part can't (icos36[8-j) and s2 + s3), that would over-run
the array, so has to be predicated, hence 0b01111.

> So I wouldn't bind them into the main makefile
> tests at least. If you want to store modified versions, maybe with their own
> makefile target, that's fine.

ok good call.
Comment 16 Luke Kenneth Casson Leighton 2021-06-28 19:26:29 BST
done
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=ea780569b30b81b07e20e4cba53673203df24af2

* pretend-predicate 0b01111 added
* loop count increased to 5
* MULH3 on icos36h works out to be exactly as
  the hard-coded constants 16, 17, 9+4, 8-4 etc.
* hard-coded block now removed.

interestingly it wasn't as straightforward as i first imagined,
but retrospectively it's logical: s1 and s2 are initially created
from t1 and t0, the name-changes of the variables in the last
(now deleted) block confused me.

the predicate mask can be hard-coded to 0b01111 with an immediate,
which results in t1, s2 and t3 being set to zero on that last block.

i'll add some comments.