Bug 228 - VP9 optimizations
Summary: VP9 optimizations
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Konstantinos Margaritis (markos)
URL:
Depends on:
Blocks: 137
  Show dependency treegraph
 
Reported: 2020-03-13 09:58 GMT by cand
Modified: 2022-10-13 09:43 BST (History)
2 users (show)

See Also:
NLnet milestone: NLNet.2019.10.031.Video
total budget (EUR) for completion of task and all subtasks: 3000
budget (EUR) for this task, excluding subtasks' budget: 3000
parent task for budget allocation: 137
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
markos={amount=3000, submitted=2022-09-26, paid=2022-10-05}


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Comment 2 Luke Kenneth Casson Leighton 2022-09-17 16:48:47 BST
commit 8dc870cfe6b219c4ebf653456b58a7147ff4e04a (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Sat Sep 17 16:48:05 2022 +0100

    add libgtest-dev to install-hdl-apt-reqs


https://git.libre-soc.org/?p=dev-env-setup.git;a=commitdiff;h=8dc870cfe6b219c4ebf653456b58a7147ff4e04a
Comment 5 Konstantinos Margaritis (markos) 2022-09-26 07:55:14 BST
Ok, SVP64 of a few variance functions used in VP9 should be completed in commit 

https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=101e3a30f90f567eaa2b7f5f7fd2306a04bfcad4

As a short explanation, the way I did it was to implement some glue C code that would call the Python Simulator (pypowersim) and some wrapper functions that would be called from the actual VP9 testsuite. When the wrapper function is called it will gather the arguments and memory that the function needs using CPython API and it will then call the function. When done, it will take the result object from the simulator and retrieve the memory and/or registers that the functions would expect the results in and again using the CPython API return the results to the caller -in this case the VP9 testsuite.
This enabled a number of often called functions for the VP8/VP9 codec to be converted to SVP64 assembly, without even having access to the hardware. But it is really slow, so I had to lower the number of iterations much lower than the actual one.

Tests can be run by make all and then running

$ ./libvpx_variance_test
[==========] Running 104 tests from 8 test suites.
[----------] Global test environment set-up.
[----------] 2 tests from C/VpxSseTest
[ RUN      ] C/VpxSseTest.RefSse/0
[       OK ] C/VpxSseTest.RefSse/0 (0 ms)
[ RUN      ] C/VpxSseTest.MaxSse/0
[       OK ] C/VpxSseTest.MaxSse/0 (0 ms)
[----------] 2 tests from C/VpxSseTest (1 ms total)

[----------] 2 tests from SVP64/VpxSseTest
[ RUN      ] SVP64/VpxSseTest.RefSse/0
[       OK ] SVP64/VpxSseTest.RefSse/0 (55276 ms)
[ RUN      ] SVP64/VpxSseTest.MaxSse/0
[       OK ] SVP64/VpxSseTest.MaxSse/0 (19063 ms)
[----------] 2 tests from SVP64/VpxSseTest (74339 ms total)

[----------] 8 tests from C/VpxMseTest
[ RUN      ] C/VpxMseTest.RefMse/0
[       OK ] C/VpxMseTest.RefMse/0 (0 ms)
[ RUN      ] C/VpxMseTest.RefMse/1
[       OK ] C/VpxMseTest.RefMse/1 (0 ms)
[ RUN      ] C/VpxMseTest.RefMse/2
[       OK ] C/VpxMseTest.RefMse/2 (0 ms)
[ RUN      ] C/VpxMseTest.RefMse/3
[       OK ] C/VpxMseTest.RefMse/3 (0 ms)
[ RUN      ] C/VpxMseTest.MaxMse/0
[       OK ] C/VpxMseTest.MaxMse/0 (0 ms)
[ RUN      ] C/VpxMseTest.MaxMse/1
[       OK ] C/VpxMseTest.MaxMse/1 (0 ms)
[ RUN      ] C/VpxMseTest.MaxMse/2
[       OK ] C/VpxMseTest.MaxMse/2 (0 ms)
[ RUN      ] C/VpxMseTest.MaxMse/3
[       OK ] C/VpxMseTest.MaxMse/3 (0 ms)
[----------] 8 tests from C/VpxMseTest (0 ms total)

[ RUN      ] SVP64/VpxMseTest.RefMse/0

[       OK ] SVP64/VpxMseTest.RefMse/0 (611909 ms)
[ RUN      ] SVP64/VpxMseTest.RefMse/1
[       OK ] SVP64/VpxMseTest.RefMse/1 (326659 ms)
[ RUN      ] SVP64/VpxMseTest.RefMse/2
[       OK ] SVP64/VpxMseTest.RefMse/2 (340466 ms)
[ RUN      ] SVP64/VpxMseTest.RefMse/3
[       OK ] SVP64/VpxMseTest.RefMse/3 (193487 ms)
[ RUN      ] SVP64/VpxMseTest.MaxMse/0
[       OK ] SVP64/VpxMseTest.MaxMse/0 (209029 ms)
[ RUN      ] SVP64/VpxMseTest.MaxMse/1
[ RUN      ] SVP64/VpxMseTest.MaxMse/3
[       OK ] SVP64/VpxMseTest.MaxMse/3 (66713 ms)
[----------] 8 tests from SVP64/VpxMseTest (1976552 ms total)

[----------] 40 tests from C/VpxVarianceTest
[ RUN      ] C/VpxVarianceTest.Zero/0
[       OK ] C/VpxVarianceTest.Zero/0 (0 ms)
[ RUN      ] C/VpxVarianceTest.Zero/1
[       OK ] C/VpxVarianceTest.Zero/1 (0 ms)
[ RUN      ] C/VpxVarianceTest.Zero/2
[       OK ] C/VpxVarianceTest.Zero/2 (0 ms)
[ RUN      ] C/VpxVarianceTest.Zero/3
[       OK ] C/VpxVarianceTest.Zero/3 (0 ms)
[ RUN      ] C/VpxVarianceTest.Zero/4
[       OK ] C/VpxVarianceTest.Zero/4 (0 ms)
[ RUN      ] C/VpxVarianceTest.Zero/5
[       OK ] C/VpxVarianceTest.Zero/5 (0 ms)
[ RUN      ] C/VpxVarianceTest.Zero/6
[       OK ] C/VpxVarianceTest.Zero/6 (0 ms)
[ RUN      ] C/VpxVarianceTest.Zero/7                                                                                                                                                                                                                                                   
[       OK ] C/VpxVarianceTest.Zero/7 (0 ms)
[ RUN      ] C/VpxVarianceTest.Zero/8
[       OK ] C/VpxVarianceTest.Zero/8 (0 ms) 
[ RUN      ] C/VpxVarianceTest.Zero/9                                                                                                                                                                                                                                                   
[       OK ] C/VpxVarianceTest.Zero/9 (0 ms)                      
[ RUN      ] C/VpxVarianceTest.Ref/0        
[       OK ] C/VpxVarianceTest.Ref/0 (0 ms)
[ RUN      ] C/VpxVarianceTest.Ref/1              
[       OK ] C/VpxVarianceTest.Ref/1 (0 ms)
[ RUN      ] C/VpxVarianceTest.Ref/2         
[       OK ] C/VpxVarianceTest.Ref/2 (0 ms)                                                                                                                                                                                                                                             
[ RUN      ] C/VpxVarianceTest.Ref/3                              
[       OK ] C/VpxVarianceTest.Ref/3 (0 ms)
[ RUN      ] C/VpxVarianceTest.Ref/4     
[       OK ] C/VpxVarianceTest.Ref/4 (0 ms)  
[ RUN      ] C/VpxVarianceTest.Ref/5                                                                                                                                                                                                                                                    
[       OK ] C/VpxVarianceTest.Ref/5 (1 ms)                       
[ RUN      ] C/VpxVarianceTest.Ref/6       
[       OK ] C/VpxVarianceTest.Ref/6 (0 ms)
[ RUN      ] C/VpxVarianceTest.Ref/7         
[       OK ] C/VpxVarianceTest.Ref/7 (0 ms)                                                                                                                                                                                                                                             
[ RUN      ] C/VpxVarianceTest.Ref/8                              
[       OK ] C/VpxVarianceTest.Ref/8 (0 ms)
[ RUN      ] C/VpxVarianceTest.Ref/9      
[       OK ] C/VpxVarianceTest.Ref/9 (0 ms)       
[ RUN      ] C/VpxVarianceTest.RefStride/0
[       OK ] C/VpxVarianceTest.RefStride/0 (0 ms)
[ RUN      ] C/VpxVarianceTest.RefStride/1                                                                                                                                                                                                                                              
[       OK ] C/VpxVarianceTest.RefStride/1 (0 ms)                 
[ RUN      ] C/VpxVarianceTest.RefStride/2   
[       OK ] C/VpxVarianceTest.RefStride/2 (0 ms)
[ RUN      ] C/VpxVarianceTest.RefStride/3        
[       OK ] C/VpxVarianceTest.RefStride/3 (0 ms)
[ RUN      ] C/VpxVarianceTest.RefStride/4   
[       OK ] C/VpxVarianceTest.RefStride/4 (0 ms)                                                                                                                                                                                                                                       
[ RUN      ] C/VpxVarianceTest.RefStride/5                        
[       OK ] C/VpxVarianceTest.RefStride/5 (0 ms)
[ RUN      ] C/VpxVarianceTest.RefStride/6
[       OK ] C/VpxVarianceTest.RefStride/6 (0 ms) 
[ RUN      ] C/VpxVarianceTest.RefStride/7
[       OK ] C/VpxVarianceTest.RefStride/7 (0 ms)
[ RUN      ] C/VpxVarianceTest.RefStride/8                                                                                                                                                                                                                                              
[       OK ] C/VpxVarianceTest.RefStride/8 (0 ms)                 
[ RUN      ] C/VpxVarianceTest.RefStride/9  
[       OK ] C/VpxVarianceTest.RefStride/9 (0 ms)
[ RUN      ] C/VpxVarianceTest.OneQuarter/0       
[       OK ] C/VpxVarianceTest.OneQuarter/0 (0 ms)
[ RUN      ] C/VpxVarianceTest.OneQuarter/1  
[       OK ] C/VpxVarianceTest.OneQuarter/1 (0 ms)                                                                                                                                                                                                                                      
[ RUN      ] C/VpxVarianceTest.OneQuarter/2                       
[       OK ] C/VpxVarianceTest.OneQuarter/2 (0 ms)
[ RUN      ] C/VpxVarianceTest.OneQuarter/3
[       OK ] C/VpxVarianceTest.OneQuarter/3 (0 ms)
[ RUN      ] C/VpxVarianceTest.OneQuarter/4                  
[       OK ] C/VpxVarianceTest.OneQuarter/4 (0 ms)
[ RUN      ] C/VpxVarianceTest.OneQuarter/5 
[       OK ] C/VpxVarianceTest.OneQuarter/5 (0 ms)
[ RUN      ] C/VpxVarianceTest.OneQuarter/6 
[       OK ] C/VpxVarianceTest.OneQuarter/6 (0 ms)
[ RUN      ] C/VpxVarianceTest.OneQuarter/7 
[       OK ] C/VpxVarianceTest.OneQuarter/7 (0 ms)
[ RUN      ] C/VpxVarianceTest.OneQuarter/8 
[       OK ] C/VpxVarianceTest.OneQuarter/8 (0 ms)
[ RUN      ] C/VpxVarianceTest.OneQuarter/9 
[       OK ] C/VpxVarianceTest.OneQuarter/9 (0 ms)
[----------] 40 tests from C/VpxVarianceTest (1 ms total)
                                     
[----------] 40 tests from SVP64/VpxVarianceTest
[ RUN      ] SVP64/VpxVarianceTest.Zero/0
[       OK ] SVP64/VpxVarianceTest.Zero/0 (3115258 ms)
[ RUN      ] SVP64/VpxVarianceTest.Zero/1
[       OK ] SVP64/VpxVarianceTest.Zero/1 (1599237 ms)
[ RUN      ] SVP64/VpxVarianceTest.Zero/2
[       OK ] SVP64/VpxVarianceTest.Zero/2 (1632482 ms)
[ RUN      ] SVP64/VpxVarianceTest.Zero/3
[       OK ] SVP64/VpxVarianceTest.Zero/3 (866733 ms)
[ RUN      ] SVP64/VpxVarianceTest.Zero/4
[       OK ] SVP64/VpxVarianceTest.Zero/4 (488774 ms)
[ RUN      ] SVP64/VpxVarianceTest.Zero/5
[       OK ] SVP64/VpxVarianceTest.Zero/5 (506917 ms)
[ RUN      ] SVP64/VpxVarianceTest.Zero/6
[       OK ] SVP64/VpxVarianceTest.Zero/6 (315762 ms)
[ RUN      ] SVP64/VpxVarianceTest.Zero/7
[       OK ] SVP64/VpxVarianceTest.Zero/7 (220856 ms)
[ RUN      ] SVP64/VpxVarianceTest.Zero/8
[       OK ] SVP64/VpxVarianceTest.Zero/8 (235377 ms)
[ RUN      ] SVP64/VpxVarianceTest.Zero/9
[       OK ] SVP64/VpxVarianceTest.Zero/9 (189669 ms)
[ RUN      ] SVP64/VpxVarianceTest.Ref/0
[       OK ] SVP64/VpxVarianceTest.Ref/0 (2390526 ms)                                                                                                                                                                                                                                   
[ RUN      ] SVP64/VpxVarianceTest.Ref/1
[       OK ] SVP64/VpxVarianceTest.Ref/1 (1254458 ms)                                                                                                                                                               
[ RUN      ] SVP64/VpxVarianceTest.Ref/2
[       OK ] SVP64/VpxVarianceTest.Ref/2 (1280849 ms)                                                                                                                                                                                                                                   
[ RUN      ] SVP64/VpxVarianceTest.Ref/3
[       OK ] SVP64/VpxVarianceTest.Ref/3 (700744 ms)                                                                                                                                                                                                                                    
[ RUN      ] SVP64/VpxVarianceTest.Ref/4
[       OK ] SVP64/VpxVarianceTest.Ref/4 (414212 ms)                                                                                                                                                                                                                                    
[ RUN      ] SVP64/VpxVarianceTest.Ref/5
[       OK ] SVP64/VpxVarianceTest.Ref/5 (428765 ms)                                                                                                                                                                
[ RUN      ] SVP64/VpxVarianceTest.Ref/6
[       OK ] SVP64/VpxVarianceTest.Ref/6 (280934 ms)                                                                                                                                                                
[ RUN      ] SVP64/VpxVarianceTest.Ref/7
[       OK ] SVP64/VpxVarianceTest.Ref/7 (210813 ms)                                                                                                                                                                
[ RUN      ] SVP64/VpxVarianceTest.Ref/8
[       OK ] SVP64/VpxVarianceTest.Ref/8 (219275 ms)                                                                                                                                                                
[ RUN      ] SVP64/VpxVarianceTest.Ref/9
[       OK ] SVP64/VpxVarianceTest.Ref/9 (183868 ms)                                                                                                                                                                
[ RUN      ] SVP64/VpxVarianceTest.RefStride/0
[       OK ] SVP64/VpxVarianceTest.RefStride/0 (2431067 ms)                                                                                                                                                         
[ RUN      ] SVP64/VpxVarianceTest.RefStride/1
[       OK ] SVP64/VpxVarianceTest.RefStride/1 (1291792 ms)                                                                                                                                                         
[ RUN      ] SVP64/VpxVarianceTest.RefStride/2
[       OK ] SVP64/VpxVarianceTest.RefStride/2 (1313705 ms)                                                                                                                                                         
[ RUN      ] SVP64/VpxVarianceTest.RefStride/3
[       OK ] SVP64/VpxVarianceTest.RefStride/3 (741109 ms)                                                                                                                                                          
[ RUN      ] SVP64/VpxVarianceTest.RefStride/4
[       OK ] SVP64/VpxVarianceTest.RefStride/4 (454184 ms)                                                                                                                                                                                                                              
[ RUN      ] SVP64/VpxVarianceTest.RefStride/5
[       OK ] SVP64/VpxVarianceTest.RefStride/4 (454184 ms)                                                                                                                                                                                                                              
[ RUN      ] SVP64/VpxVarianceTest.RefStride/5
[       OK ] SVP64/VpxVarianceTest.RefStride/5 (467841 ms)                                                                                                                                                                                                                              
[ RUN      ] SVP64/VpxVarianceTest.RefStride/6
[       OK ] SVP64/VpxVarianceTest.RefStride/6 (324436 ms)
[ RUN      ] SVP64/VpxVarianceTest.RefStride/7
[       OK ] SVP64/VpxVarianceTest.RefStride/7 (254069 ms)
[ RUN      ] SVP64/VpxVarianceTest.RefStride/8
[       OK ] SVP64/VpxVarianceTest.RefStride/8 (263426 ms)        
[ RUN      ] SVP64/VpxVarianceTest.RefStride/9
[       OK ] SVP64/VpxVarianceTest.RefStride/9 (229721 ms)        
[ RUN      ] SVP64/VpxVarianceTest.OneQuarter/0
[       OK ] SVP64/VpxVarianceTest.OneQuarter/0 (826340 ms)       
[ RUN      ] SVP64/VpxVarianceTest.OneQuarter/1
[       OK ] SVP64/VpxVarianceTest.OneQuarter/1 (443795 ms)
[ RUN      ] SVP64/VpxVarianceTest.OneQuarter/2
[       OK ] SVP64/VpxVarianceTest.OneQuarter/2 (450765 ms)                                                                                                                                                                                                                             
[ RUN      ] SVP64/VpxVarianceTest.OneQuarter/3
[       OK ] SVP64/VpxVarianceTest.OneQuarter/3 (258029 ms)                                                                                                                                                         
[ RUN      ] SVP64/VpxVarianceTest.OneQuarter/4
[       OK ] SVP64/VpxVarianceTest.OneQuarter/4 (160789 ms)                                                                                                                                                                                                                             
[ RUN      ] SVP64/VpxVarianceTest.OneQuarter/5
[       OK ] SVP64/VpxVarianceTest.OneQuarter/5 (164404 ms)
[ RUN      ] SVP64/VpxVarianceTest.OneQuarter/6
[       OK ] SVP64/VpxVarianceTest.OneQuarter/6 (115362 ms)       
[ RUN      ] SVP64/VpxVarianceTest.OneQuarter/7
[       OK ] SVP64/VpxVarianceTest.OneQuarter/7 (91212 ms)
[ RUN      ] SVP64/VpxVarianceTest.OneQuarter/8
[       OK ] SVP64/VpxVarianceTest.OneQuarter/8 (93363 ms)        
[ RUN      ] SVP64/VpxVarianceTest.OneQuarter/9
[       OK ] SVP64/VpxVarianceTest.OneQuarter/9 (80609 ms)                                                                                                                                                          
[----------] 40 tests from SVP64/VpxVarianceTest (26991529 ms total)
                                               
[----------] 2 tests from C/SumOfSquaresTest 
[ RUN      ] C/SumOfSquaresTest.Const/0                                                                                                                                                                                                                                                 
[       OK ] C/SumOfSquaresTest.Const/0 (0 ms)                    
[ RUN      ] C/SumOfSquaresTest.Ref/0        
[       OK ] C/SumOfSquaresTest.Ref/0 (0 ms)                                                                                                                                                                        
[----------] 2 tests from C/SumOfSquaresTest (1 ms total)         
                                               
[----------] 2 tests from SVP64/SumOfSquaresTest
[ RUN      ] SVP64/SumOfSquaresTest.Const/0
[       OK ] SVP64/SumOfSquaresTest.Const/0 (636899 ms)
[ RUN      ] SVP64/SumOfSquaresTest.Ref/0
[       OK ] SVP64/SumOfSquaresTest.Ref/0 (649444 ms)
[----------] 2 tests from SVP64/SumOfSquaresTest (1286343 ms total)
                                                                                                                                                                                                                                                                                        
[----------] Global test environment tear-down                    
[==========] 104 tests from 8 test suites ran. (30328767 ms total)
[  PASSED  ] 104 tests.
Comment 6 Luke Kenneth Casson Leighton 2022-09-30 09:26:13 BST
from Konstantinos:

Ok, so a little overview first. The goal was to port VP8 and VP9 code into SVP64 assembly.
In order to establish that any new code works properly with VP8 and VP9, the library itself includes a testsuite which provides multiple unit tests that ensure a function will produce the same result always -and thus producing the same bit-exact video on all platforms.
So, the best way to ensure that our SVP64 VP8 and VP9 code works is by running the testsuite. But we cannot do that, because there is no hardware-capable of SVP64 yet, not even in FGPA form. The only thing available is what the Python Power Simulator (pypowersim), which is actually the reference simulator.

Now, we *could* in theory run the whole VP8 & VP9 test suite inside this simulator, but since it's at least 2000-5000 times slower, this means that what takes now a few seconds in the testsuite it would take about 10 hours! So we had to find an alternative -until we can run on actual hardware, FPGA, or a faster simulator (cavatools?).
What I came up with, and it proved to be working great, is to run the whole testsuite in native mode, and run *only* the SVP64 functions inside pypowersim. For this reason, I created a wrapper function, that provides the glue code from the native C code to the pypowersim -which runs in Python. I'm using Python C API, and literally construct the arguments that are needed by the function in question, for example, this function which can be seen in variance_ref.c:


uint32_t vpx_get4x4sse_cs_svp64(const uint8_t *src_ptr, int src_stride,
                                const uint8_t *ref_ptr, int ref_stride)

By convention to the ABI, this takes 4 arguments, in registers (GPRs) 3, 4, 5, 6 and returns in register (GPR) 3.
So, for this case I wrote a function vpx_get4x4sse_cs_svp64() in variance_svp64_wrappers.c, which does exactly that, in the following steps:

* Sets up the Python C API for use inside C
* Constructs the pypowersim state object, with Python Objects for memory, registers, mmu, svstate, etc.
* Creates the python object arguments to be passed to the simulator as registers
* Calls the function -which btw, has been compiled in SVP64/LibreSOC mode by the fork of binutils assembler that Dmitry has been working on
* This actually starts the simulator and RUNS in LibreSOC/SVP64 mode, just as if we would have started the process manually!
* After a while, it completes and returns a result object, which we read and get the result from the expected register (GPR 3).
* We return it to the testsuite and it is checked against the reference value, if it is the same, that means our function produced the right results, if not, we keep trying until our problem was fixed!

Similarly for other functions, we pass a buffer or have a buffer returned, which means we have to copy data to/from the simulator.

The end result was that this method has proved to be invaluable and sped up development by at least an order of magnitude. I plan to be using the same method for all other audio/video codecs, I'm actually doing the AV1 which should be done these days. I've made it  reusable so it could be used in any other similar software that needs to be ported to SVP64.

Now, it would be possible to port some functions directly without what I did, but it would be a much slower process, and we would never know if it would actually work, until we would try to integrate this code with the library itself -and its testsuite. And we would have to wait for actual hardware for that.
Comment 7 Luke Kenneth Casson Leighton 2022-09-30 09:27:34 BST
I forgot to mention that the actual SVP64 function is the vpx_get4x4sse_cs_svp64_real, which is the *actual* SVP64 assembly form of the function and in the file vpx_get4x4sse_cs_svp64_real.s.
Similarly for the other functions.
Comment 9 Luke Kenneth Casson Leighton 2022-10-01 02:01:08 BST
  47     # Load 4 elements from src_ptr and ref_ptr
  48     sv.lha  *src, 0(src_col)                # Load 4 ints from (src_ptr)
  49     sv.lha  *ref, 0(ref_col)                # Load 4 ints from (ref_ptr)

these can both be:

    sv.lha/els *src, 2(src_col)  # element-strided multiplies i by immediate

then src_col may be added to by VL
where VL has been put into a temp register (e.g. r30) *inside* the loop
not outside as a static quantity:

L1:
    setvl r30, .....

ok now you can mul r30 by 2

    mulli r29, r30, 2
    add src_col, src_col, r29