Bug 229 - AV1 optimizations
Summary: AV1 optimizations
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Konstantinos Margaritis
Depends on:
Blocks: 952 137
  Show dependency treegraph
Reported: 2020-03-13 09:59 GMT by cand
Modified: 2022-10-25 11:01 BST (History)
2 users (show)

See Also:
NLnet milestone: NLNet.2019.10.031.Video
total budget (EUR) for completion of task and all subtasks: 4000
budget (EUR) for this task, excluding subtasks' budget: 4000
parent task for budget allocation: 137
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
markos={amount=3200, submitted=2022-10-14, paid=2022-10-20} lkcl={amount=800, submitted=2022-10-14, paid=2022-10-20}


Note You need to log in before you can comment on or make changes to this bug.
Description cand 2020-03-13 09:59:06 GMT
Optimizing AV1 code in dav1d with new instructions.
Comment 1 Luke Kenneth Casson Leighton 2022-09-28 18:14:58 BST

added a horizontal-or demo which is easily adapted to do
horizontal-add or horizontal-mul (both useful in VPUs)
Comment 2 Luke Kenneth Casson Leighton 2022-10-14 11:30:19 BST
first working version of AV1 assembler.
needs more work.

Comment 3 Konstantinos Margaritis 2022-10-14 11:44:09 BST
Using a similar method to VP9 investigation, we wrote an SVP64 implementation of dav1d's cdef_find_dir function, which is included in src/cdef_tmpl.c.

The SVP64 function demonstrates using all the available registers to minimize loads (unfortunately we cannot do zero-loads at the moment, but we will be when elwidth/subvl are fully operational). The function loads and processes in multiple ways a 8x8 array of pixels, in horizontal/vertical and diagonals (normal and slanted) producing a "cost" array of 8 elements. The results between C reference function and SVP64 are exactly the same:

C ref:
04858917 05cf5742 021c7323 01c68c56 05931132 03de109a 02f8e489 00f02d4b
SVP64 (register dump):
reg 24 04858917 05cf5742 021c7323 01c68c56 05931132 03de109a 02f8e489 00f02d4b

As a future improvement we could adopt elwidth=16 packed loads so that we can minimize the number of used registers even more and we can do the whole processing without a single memory access -apart from the initial buffer load!

This implementation demonstrates how complicated algorithms can be optimized with SVP64 and how the abundance of registers can almost eliminate memory access.
Comment 5 Luke Kenneth Casson Leighton 2022-10-14 15:02:17 BST


for max-of-array-plus-index-of-first a small tweak
will be needed, to make sure the cmp doesn't activate
when the element-being-compared is equal.