Bug 137 - NLNet 2019 Video Acceleration Proposal 2019-10-031
Summary: NLNet 2019 Video Acceleration Proposal 2019-10-031
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Milestones (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL: https://libre-riscv.org/vpu/ https://...
Depends on: 159 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235
Blocks:
  Show dependency tree
 
Reported: 2019-09-23 09:36 BST by Luke Kenneth Casson Leighton
Modified: 2020-04-04 11:57 BST (History)
5 users (show)

See Also:
NLnet milestone: NLNet.2019.Video
total budget (EUR) for completion of task and all subtasks: 50000
budget (EUR) for completion of task (excludes budget allocated to subtasks): 0
parent task for budget allocation:
child tasks for budget allocation: 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2019-09-23 09:36:54 BST
To add video acceleration to the Libre RISC-V SoC, upstream, for
ffmpeg, gstreamer, libswscale, libh264, libh265 and other libraries.
https://libre-riscv.org/nlnet_2019_video/

https://libre-riscv.org/vpu/

Audio
* bug #218, MP3
* bug #219, AC3
* bug #220, Vorbis
* bug #221, Opus

Video
* bug #222, MJPEG (JPEG)
* bug #223, MPEG1/2
* bug #224, MPEG4 ASP (xvid)
* bug #225, H.264
* bug #226, H.265
* bug #227, VP8
* bug #228, VP9
* bug #229, AV1

Opcodes: bug #234 implement opcodes in hardware
* rgb/bgr24 (TBD in 3D GPU or in this one?)
* rgbx/bgrx/xrgb/xbgr32 (TBD in 3D GPU or in this one?)
* nv12 (TBD in 3D GPU or in this one?)
* nv21 (TBD in 3D GPU or in this one?)

Simulator
* bug #230 discuss and add opcode(s) proposed by lauri
* bug #233 set up unit tests for opcodes under simulator

Standards Documentation: bug #231
* write up all opcodes (related to #230) as formal standards


note: this is where the iterative loop comes in.  there will be several rounds adding different opcodes to try out

FPGA - bug #235
* run unit tests under FPGA
* run full OS (VLC?) demo under FPGA

todo, edit this comment and list a series of tasks to assign budgets to.  then, create bugreports for each.  see bug #48 for a template

TODO, subdivide these down into smaller tasks (discuss below) so that reasonably accurate budgetary amounts can be assigned to them.  slight overestimation (10 to 15% or so) is recommended (and acceptable).
Comment 1 cand 2020-01-24 08:27:26 GMT
https://libre-riscv.org/vpu/

Audio
* 2 weeks MP3 EUR 750
* 2 weeks AC3 EUR 750
* 2 weeks Vorbis EUR 750
* 2 weeks Opus EUR 750

Video
* 4 weeks MJPEG (JPEG) EUR 1500
* 4 weeks MPEG1/2 EUR 1500
* 5 weeks MPEG4 ASP (xvid) EUR 2000
* 8 weeks H.264 EUR 3000
* 10 weeks H.265 EUR 4000
* 8 weeks VP8 EUR 3000
* 8 weeks VP9 EUR 3000
* 10 weeks AV1 EUR 4000

Total EUR 25000

* Opcodes development and discussion: EUR 4000
* Opcodes Standards writeup: EUR 2000
* Implementation of opcodes in simulator: EUR 5000
* Unit tests in simulator: EUR 3000
* Hardware implementation: EUR 9000
* FPGA tests: EUR 2000
Comment 2 cand 2020-01-24 08:48:06 GMT
Each codec then has these phases:
- research
- for each hotspot, implementation
- for each target library, upstreaming

HW implementations of new instructions would be later, once the instructions are known.
Comment 3 Luke Kenneth Casson Leighton 2020-01-24 09:07:07 GMT
(In reply to cand from comment #2)
> Each codec then has these phases:
> - research
> - for each hotspot, implementation
> - for each target library, upstreaming

ok great, do you have an estimation of time (and budget you'd like to receive) for each? 1 week research, 2 week impl, 3 day upstream coordination, that sort of thing?

we can subdivide later (3 subbugs per each top bug) if you would like part-payment however that is for later.

the focus now is to identify toplevel and assign budgets. 

> HW implementations of new instructions would be later, once the instructions
> are known.

yes.  or, more to the point, you advise us what you would like, then we implement them in a simulator (which we have to budget how to run under that, btw - it may be that we only run a subset of the code, say, only the algorithm or a unit test rather than full VLC or sonething)

then after the cycles/sec is confirmed *then* we implement that opcode in hw and finally actually run under an FPGA.  this will be much later, at the end of the process.
Comment 4 cand 2020-01-24 09:22:26 GMT
Each codec is of different complexity. The audio codecs usually only have a single hotspot, while at the other end AV1 has several dozen. I'll do a quick pass later, to get rough figures on those.

I thought the simulator would be part of the implementation loop?
Comment 5 Luke Kenneth Casson Leighton 2020-01-24 10:06:38 GMT
(In reply to cand from comment #4)
> Each codec is of different complexity. The audio codecs usually only have a
> single hotspot, while at the other end AV1 has several dozen.

thought so.

> I'll do a
> quick pass later, to get rough figures on those.

great.
 
> I thought the simulator would be part of the implementation loop?

hmmm yes, however think about it: several CODECs will share the same opcodes.  you don't make a YUV2RGB opcode for VP9 and a different one for MPEG :)

so i was kinda leaning towards them being on their own (aggregated) iterative cycle, if you know what i mean.

if we can get a rough idea in advance of the sorts of opcodes needed, bear in mind that for the most part they need to be "scalar" in nature because the Vector System adds that hardware-for-loop on top *of* scalar operations, it would be very handy.

then those can also be analysed as to a simulation implementation timescale and hw timescale and budget as well.

we are not going to be able to predict exactly everything here, that is what the iterations are for.  we just need a start.
Comment 6 cand 2020-01-24 11:23:26 GMT
Weren't the colorspace conversions part of the GPU milestone? That's what I understood from the ML earlier.
Comment 7 Luke Kenneth Casson Leighton 2020-01-24 11:29:50 GMT
(In reply to cand from comment #6)
> Weren't the colorspace conversions part of the GPU milestone? That's what I
> understood from the ML earlier.

yes good point, so we need to make sure not to double-allocate budget.
Comment 8 cand 2020-01-24 19:49:17 GMT
Rough relative complexities:

MP3                     1       1%
AC3                     1       1%
Vorbis                  1       1%
Opus                    1       1%

MJPEG (JPEG)            2       2%
MPEG1/2                 2       2%
MPEG4 ASP (xvid)        4       5%
H.264                   10      11%
H.265                   20      23%
VP8                     8       9%
VP9                     10      11%
AV1                     28      32%

This doesn't translate well to budget though, no sense in spending a third on AV1. Perhaps a more sensible goal would be to target the largest hot spots of each, with only smaller budget differences due to complexity.

Another point to consider is that while ffmpeg is the prime lib, parts of accel code made for ffmpeg aren't really usable in the various standalone libs. Different structures, etc. In order to not write things twice, some decisions need to be made on which upstreams particularly matter.
Comment 9 Luke Kenneth Casson Leighton 2020-01-25 11:30:49 GMT
(In reply to cand from comment #8)

> This doesn't translate well to budget though, no sense in spending a third
> on AV1. Perhaps a more sensible goal would be to target the largest hot
> spots of each, with only smaller budget differences due to complexity.

yes.  and, during later iterations, do some more.
 
> Another point to consider is that while ffmpeg is the prime lib, parts of
> accel code made for ffmpeg aren't really usable in the various standalone
> libs. Different structures, etc. In order to not write things twice, some
> decisions need to be made on which upstreams particularly matter.

well, ultimately, gstreamer has an ffmpeg plugin, ffmpeg has a gstreamer plugin, vdpau has a vaa plugin, vaa has a vdpau plugin, it's all circular [1] and up its own backside [2], so whichever we pick is good :)

which route would be easiest for you, do let's go with that.

[1] yes i managed to install both vdpau and vaa recursively, once, whoops...
[2] the beatles "yellow submarine" film demonstrates this well
Comment 10 cand 2020-01-25 18:24:13 GMT
Okay, then I'd say ffmpeg for everything else except av1 (dav1d) and jpeg (libjpeg-turbo).

Time and budget, your earlier comment on 1 week research, 2 week impl, 3 day upstream coordination is fairly on point, for one hotspot (or a couple smaller ones). For the later iterations only the impl phase would be budgeted.

I'd say 400e/wk, so 400 for research, 800 for one impl iteration, and 240 for the upstream part. I don't know how difficult the fpga side is, how much should be budgeted for that; IIRC you also said the entire amount should be used this year, or it'd be lost. Starting point for discussion anyway.
Comment 11 Luke Kenneth Casson Leighton 2020-02-23 18:35:24 GMT
> should be budgeted for that; IIRC you also said the entire amount should be
> used this year, or it'd be lost. Starting point for discussion anyway.

we have until mid 2021 so not as heavy there.
Comment 12 Luke Kenneth Casson Leighton 2020-02-23 18:52:38 GMT
lauri do the budgets look reasonable?
l.
Comment 13 cand 2020-02-23 18:58:46 GMT
Sorry, which ones?
Comment 14 cand 2020-02-23 19:09:00 GMT
Oh, you edited comment #1, emails don't go out for edits so didn't see it at first. Yes, they look ok, other than being off by 1k (51k total).
Comment 15 Luke Kenneth Casson Leighton 2020-02-23 19:43:16 GMT
adjusted thx. alain this one needs a writeup too when the time is right.
similar to http://bugs.libre-riscv.org/show_bug.cgi?id=158#c4
except we need to create the individual bugreports first (all 17 of them)
Comment 16 cand 2020-03-13 10:16:31 GMT
# Schedule A to be attached to MoU

List of tasks, plus description, bugtracker URL and budget

# MP3 optimizations
Optimizing MP3 code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=218
Budget: EUR 750

# AC3 optimizations
Optimizing AC3 code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=219
Budget: EUR 750

# Vorbis optimizations
Optimizing Vorbis code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=220
Budget: EUR 750

# Opus optimizations
Optimizing Opus code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=221
Budget: EUR 750

# JPEG optimizations
Optimizing JPEG code in libjpeg-turbo with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=222
Budget: EUR 1500

# MPEG1/2 optimizations
Optimizing MPEG1/2 code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=223
Budget: EUR 1500

# MPEG4 ASP optimizations
Optimizing MPEG4 ASP (xvid) code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=224
Budget: EUR 2000

# H.264 optimizations
Optimizing H.264 code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=225
Budget: EUR 3000

# H.265 optimizations
Optimizing H.265 code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=226
Budget: EUR 4000

# VP8 optimizations
Optimizing VP8 code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=227
Budget: EUR 3000

# VP9 optimizations
Optimizing VP9 code in ffmpeg with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=228
Budget: EUR 3000

# AV1 optimizations
Optimizing AV1 code in dav1d with new instructions.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=229
Budget: EUR 4000

# Video opcode development and discussion
Video opcode development and discussion is needed, as well as research
and informal write-up.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=230
Budget: EUR 4000

# Video Opcodes Standards "Formal" writeup
Video Opcodes Standards writeup is required, to a level that is acceptable
for formal proposal to the OpenPOWER Foundation
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=231
Budget: EUR 2000

# Implementation of video opcodes in simulator
Implementation of video opcodes in simulator is needed, so that the
effectiveness of the opcodes can be tested prior to implementing them
in hardware (which simulates 10,000 to 100,000 times slower)
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=232
Budget: EUR 5000

# Audio and Video unit tests in simulator
Audio and Video unit tests are needed, to be run in the simulator.
These are not the full GUI, just the core algorithm.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=233
Budget: EUR 3000

# Hardware implementation of video opcodes
Hardware implementation of video opcodes is needed, implementing
the instructions that were demonstrated to be effective from earlier
(software) simulations.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=234
Budget: EUR 9000

# Video opcode FPGA tests
Video opcode FPGA tests are needed, demonstrating the correctness
of the hardware implementation of the opcodes.
URL: http://bugs.libre-riscv.org/show_bug.cgi?id=235
Budget: EUR 2000
Comment 17 Luke Kenneth Casson Leighton 2020-04-04 11:57:23 BST
Summary sentence for MoU

Video acceleration is a necessary component for any modern CPU and GPU. Given the large amount of time the typical user spends on videos and its applications like videoconferencing, a performant and power-efficient implementation is necessary for wide adoption. With Zoom (etc.) now being critical to our modern life, and full of security holes, full transparency in the video encode/decode algorithms is more important than ever.