Optimizing AV1 code in dav1d with new instructions.
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=b58869c4f2efc7ab4a885e3a1de39fda616ddd57 added a horizontal-or demo which is easily adapted to do horizontal-add or horizontal-mul (both useful in VPUs)
first working version of AV1 assembler. needs more work. https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=a65084c24742b43e79da714e5cd08f0d24a83eab
Using a similar method to VP9 investigation, we wrote an SVP64 implementation of dav1d's cdef_find_dir function, which is included in src/cdef_tmpl.c. The SVP64 function demonstrates using all the available registers to minimize loads (unfortunately we cannot do zero-loads at the moment, but we will be when elwidth/subvl are fully operational). The function loads and processes in multiple ways a 8x8 array of pixels, in horizontal/vertical and diagonals (normal and slanted) producing a "cost" array of 8 elements. The results between C reference function and SVP64 are exactly the same: C ref: 04858917 05cf5742 021c7323 01c68c56 05931132 03de109a 02f8e489 00f02d4b SVP64 (register dump): reg 24 04858917 05cf5742 021c7323 01c68c56 05931132 03de109a 02f8e489 00f02d4b As a future improvement we could adopt elwidth=16 packed loads so that we can minimize the number of used registers even more and we can do the whole processing without a single memory access -apart from the initial buffer load! This implementation demonstrates how complicated algorithms can be optimized with SVP64 and how the abundance of registers can almost eliminate memory access.
File with actual SVP64 implementation: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=media/video/av1/src/ppc/cdef_tmpl_svp64_real.s;hb=HEAD Original C function: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=media/video/av1/src/cdef_tmpl.c;hb=HEAD
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=380d9fc5bb078c313dc0d7dc4fcfcef63a990ca2 max-of-array-plus-index-of-the-last-max-element for max-of-array-plus-index-of-first a small tweak will be needed, to make sure the cmp doesn't activate when the element-being-compared is equal.
https://code.videolan.org/videolan/dav1d/-/blob/master/src/arm/64/cdef_tmpl.S