229 – AV1 optimizations

Bug 229 - AV1 optimizations

Summary: AV1 optimizations

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Source Code (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Konstantinos Margaritis (markos)

URL:

Depends on:
Blocks:	952 137
	Show dependency tree / graph

Reported:	2020-03-13 09:59 GMT by cand
Modified:	2023-06-13 20:11 BST (History)
CC List:	2 users (show)

See Also:	230
NLnet milestone:	NLNet.2019.10.031.Video
total budget (EUR) for completion of task and all subtasks:	4000
budget (EUR) for this task, excluding subtasks' budget:	4000
parent task for budget allocation:	137
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:	markos={amount=3200, submitted=2022-10-14, paid=2022-10-20} lkcl={amount=800, submitted=2022-10-14, paid=2022-10-20}

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description cand 2020-03-13 09:59:06 GMT

Optimizing AV1 code in dav1d with new instructions.

Comment 1 Luke Kenneth Casson Leighton 2022-09-28 18:14:58 BST

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=b58869c4f2efc7ab4a885e3a1de39fda616ddd57

added a horizontal-or demo which is easily adapted to do
horizontal-add or horizontal-mul (both useful in VPUs)

Comment 2 Luke Kenneth Casson Leighton 2022-10-14 11:30:19 BST

first working version of AV1 assembler.
needs more work.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=a65084c24742b43e79da714e5cd08f0d24a83eab

Comment 3 Konstantinos Margaritis (markos) 2022-10-14 11:44:09 BST

Using a similar method to VP9 investigation, we wrote an SVP64 implementation of dav1d's cdef_find_dir function, which is included in src/cdef_tmpl.c.

The SVP64 function demonstrates using all the available registers to minimize loads (unfortunately we cannot do zero-loads at the moment, but we will be when elwidth/subvl are fully operational). The function loads and processes in multiple ways a 8x8 array of pixels, in horizontal/vertical and diagonals (normal and slanted) producing a "cost" array of 8 elements. The results between C reference function and SVP64 are exactly the same:

C ref:
04858917 05cf5742 021c7323 01c68c56 05931132 03de109a 02f8e489 00f02d4b
SVP64 (register dump):
reg 24 04858917 05cf5742 021c7323 01c68c56 05931132 03de109a 02f8e489 00f02d4b

As a future improvement we could adopt elwidth=16 packed loads so that we can minimize the number of used registers even more and we can do the whole processing without a single memory access -apart from the initial buffer load!

This implementation demonstrates how complicated algorithms can be optimized with SVP64 and how the abundance of registers can almost eliminate memory access.

Comment 4 Konstantinos Margaritis (markos) 2022-10-14 12:03:06 BST

File with actual SVP64 implementation:

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=media/video/av1/src/ppc/cdef_tmpl_svp64_real.s;hb=HEAD

Original C function:

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=media/video/av1/src/cdef_tmpl.c;hb=HEAD

Comment 5 Luke Kenneth Casson Leighton 2022-10-14 15:02:17 BST

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=380d9fc5bb078c313dc0d7dc4fcfcef63a990ca2

max-of-array-plus-index-of-the-last-max-element

for max-of-array-plus-index-of-first a small tweak
will be needed, to make sure the cmp doesn't activate
when the element-being-compared is equal.

Comment 6 Luke Kenneth Casson Leighton 2023-06-13 20:11:20 BST

https://code.videolan.org/videolan/dav1d/-/blob/master/src/arm/64/cdef_tmpl.S