Bug 91 - Design and implement texturing opcodes for 3D graphics
Summary: Design and implement texturing opcodes for 3D graphics
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Jacob Lifshay
URL:
Depends on:
Blocks:
 
Reported: 2019-06-05 01:46 BST by Luke Kenneth Casson Leighton
Modified: 2020-12-15 18:53 GMT (History)
2 users (show)

See Also:
NLnet milestone: NLnet.2019.02.012
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2019-06-05 01:46:04 BST
Texture maps need to be addressed by FP numbers and interpolated.

The maps are big and regularly arranged, and the loading on memory best covered by a separate instruction path.

Also with interpolation being involved they are best done as special instructions which perform the interpolation without needing extra CPU load.
Comment 1 Luke Kenneth Casson Leighton 2019-06-05 01:51:34 BST
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-June/001657.html

Background. Earlier in thread discusses how LD/ST memory AGEN misses are affected by the regularity of the Texture maps.

http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-June/001629.html

(Edit 12aug2019) extra context:
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002469.html
Comment 2 Jacob Lifshay 2019-06-05 06:09:01 BST
This is one of the bigger tasks (several months of work) due to the sheer number of different texture formats that need to be implemented, so it will definitely need some budget, which I'll let Luke allocate.

Since I'm planning on implementing all the required formats in Rust as part of the Vulkan driver (since they need to work on x86 as well), we can split that out into a separate library to use for implementing the HW decoding as well. I think I could write it in such a way that we can generate both nmigen and LLVM IR from the same code. Additionally, the Rust code can be used as a Python library for testing purposes.

I think we should support YCbCr textures, since they are commonly used in video formats.

We will need to decide which compressed texture formats to support, Vulkan requires supporting at least one of these three:

* BC formats (also called S3TC and DXTn/DXTC)
    common on desktops
    effectively required for OpenGL
    oldest, I think the patents have expired in 2018
* ETC2 and EAC formats 
    required for OpenGL
    not patented
    http://www.phoronix.com/vr.php?view=MTE1ODU
* ASTC LDR formats
    common on mobile devices
    best compression ratios
    royalty free
    http://www.phoronix.com/vr.php?view=MTE1NDk

ASTC also has support for HDR formats.

If we can, I'd like to support all four.
Comment 3 Luke Kenneth Casson Leighton 2019-06-09 01:45:09 BST
(In reply to Jacob Lifshay from comment #2)

> This is one of the bigger tasks (several months of work) due to the sheer
> number of different texture formats that need to be implemented, so it will
> definitely need some budget, which I'll let Luke allocate.

 it sounds... like, to be honest, we need to have a serious think about
 that.
 
> Since I'm planning on implementing all the required formats in Rust as part
> of the Vulkan driver (since they need to work on x86 as well), we can split
> that out into a separate library to use for implementing the HW decoding as
> well. I think I could write it in such a way that we can generate both
> nmigen and LLVM IR from the same code.

 that would be really good.

> Additionally, the Rust code can be
> used as a Python library for testing purposes.

 as long as doing so does not interfere with formal proofs (clearly yosys
 cannot be made to critically depend on either python or a rust library).

 we have sfpy as a link to softfloat: "parallel mirroring" during simulations
 and checking simulated results against outside resources *is* the entire
 point of working in python, it's extremely convenient.

> I think we should support YCbCr textures, since they are commonly used in
> video formats.

 as we're doing a VPU as well, yes.

 yes on supporting as many texture formats as possible: we may however
 need to seriously prioritise.
Comment 4 Luke Kenneth Casson Leighton 2019-08-05 07:21:41 BST
Do we have a link to the relevant Vulkan API for textures? So we know what to interpolate?
Comment 5 Jacob Lifshay 2019-08-05 07:25:34 BST
(In reply to Luke Kenneth Casson Leighton from comment #4)
> Do we have a link to the relevant Vulkan API for textures? So we know what
> to interpolate?

it's somewhat spread throughout the vulkan spec, will find relevant links.
Comment 6 Luke Kenneth Casson Leighton 2019-08-12 11:47:56 BST
https://www.khronos.org/registry/DataFormat/specs/1.1/dataformat.1.1.pdf
Comment 7 Jacob Lifshay 2019-10-03 14:00:49 BST
one thing to watch out for:
apparently ASTC may not be open source: https://www.phoronix.com/scan.php?page=news_item&px=ASTC-Restrictive-License
Comment 8 Luke Kenneth Casson Leighton 2019-10-03 14:24:21 BST
Ok good find.
Comment 9 Jacob Lifshay 2019-10-16 06:20:49 BST
(In reply to Jacob Lifshay from comment #7)
> one thing to watch out for:
> apparently ASTC may not be open source:
> https://www.phoronix.com/scan.php?page=news_item&px=ASTC-Restrictive-License

will need to confirm openness, Khronos claims it's royalty-free:
https://www.khronos.org/news/press/khronos-releases-atsc-next-generation-texture-compression-specification
Comment 10 Jacob Lifshay 2020-02-17 17:48:59 GMT
(In reply to Jacob Lifshay from comment #9)
> (In reply to Jacob Lifshay from comment #7)
> > one thing to watch out for:
> > apparently ASTC may not be open source:
> > https://www.phoronix.com/scan.php?page=news_item&px=ASTC-Restrictive-License
> 
> will need to confirm openness, Khronos claims it's royalty-free:
> https://www.khronos.org/news/press/khronos-releases-atsc-next-generation-
> texture-compression-specification

ASTC effectively is now open: https://www.phoronix.com/scan.php?page=news_item&px=Arm-ASTC-Encoder-Apache-2.0
Comment 11 Luke Kenneth Casson Leighton 2020-09-20 10:43:04 BST
phoronix discussion idea from xfcemint https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1207878-libre-soc-still-persevering-to-be-a-hybrid-cpu-gpu-that-s-100-open-source?p=1208141#post1208141

[QUOTE=xfcemint;n1208141]The problem with adding a texure unit is that it is a lot of work.

It is much, much easier to just use a special instruction for bilinear filtering.

So, for start, perhaps it is a better idea to not use a texture unit.[/QUOTE]

this sounds exactly like the kind of useful strategy that would get us some reasonable performance without a full-on approach, and, as a hybrid processor it would fit better and it's also much more along the RISC strategy.  thank you for the suggestion.
Comment 12 Luke Kenneth Casson Leighton 2020-09-20 10:44:20 BST
xfcemint https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1207878-libre-soc-still-persevering-to-be-a-hybrid-cpu-gpu-that-s-100-open-source?p=1208162#post1208162

You can replace the entire functionality of a texture unit by a few special instructions in your CPU:

1. an custom instruction for bilinear filtering (you already have that)
2. an instruction for 2x2 matrix transform
3. an instruction to load a 2x2 pixel block from a texture.

About item 3, you can do some complex stuff there if you want. For example, you can postulate that textures are stored as 4x4 blocks of pixels, alligned, and the instruction has to handle that. The additional complexity is that the instruction may need to load pixel data from multiple pixel blocks.


There is one more thing required to replace the functionality of a texture unit: a texture cache. A texture cache should be shared by all GPU cores on a die, the cache is read-only, and it has direct access to memory. Each CPU-GPU has special LOAD instruction(s) to load data from the texture cache.

A texture cache does not need to be very fast (as long as your OoO engine can find other stuff to do while waiting for data from the cache). A benefit of a texture cache is that it reduces the required bandwidth to main memory.
Comment 13 Luke Kenneth Casson Leighton 2020-09-20 11:12:14 BST
xfcemint https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1207878-libre-soc-still-persevering-to-be-a-hybrid-cpu-gpu-that-s-100-open-source?p=1208137#post1208137

A texture unit has read-only access to the memory. All textures are basically just a huge array of constants. I think you don't need to monitor that by the OoO unit, because there is absolutely no need to monitor constants. Even if a texture gets corrupted, there is no problem: it's just one pixel having a wrong color. Nobody notices that.

You need to have at most one texture unit per core. In your case, it will be exactly one texture unit, for simplification.

A texture unit can be pipelined or not pipelined. A non-pipelined unit would accept a SIMD request to produce about 8 samples.

The inputs for a request are:
- texture address,
- texture width and height in pixels, pitch,
- pixel format,
- (x,y) sample coordinates, x8 for 8 samples
- optionally, a transformation matrix for x,y coordinates

In some texture units, all the samples in a single request must be from the same texture. I think that is not strictly necessary, but it probably reduces the complexity of the texture unit.

A texture unit usually stores a block of 4x4 pixel data in a single cache line. The textures in the GPU memory use the same format: 4x4 pixel blocks. Textures might also use Lebesgue curve. So, there is 16 pixels in a block but it doesn't have to be in RGBA format. "Pixel format" can be something really crazy. That's how texture compression works. It reduces memory bandwidth.

The problem of adding a texture unit to your design is to figure out how to keep it utilized, because shaders don't do texture sampling all the time. When shaders are doing something else, the texture unit is doing nothing, wasting power.
What latency should the texture unit have? Should it be a low-latency, SIMD, non-pipelined design, or a high-latency, pipelined design?
Comment 14 Luke Kenneth Casson Leighton 2020-09-20 15:47:16 BST
Wait, I have an even better idea.

Instead of having three sepatare instructions to replace a texture unit (bilinear interpolation, coordinate transform, LOAD from texture), you would be better off with a single instruction.

You add a custom instruction SAMPLE which does all the three mentioned things together. Maybe you can drop out the coordinate transform as a separate instruction, perhaps, if you find it necessary.

So a SAMPLE instruction need all the inputs that I mentioned previously. As a result, it produces a bilinearily interpolated RGB(A?) sample from a texture. Such an instruction would be a great fit to your architecture. It does a lot of work in a single instruction, so that reduces the instruction issue bottleneck and the pressure on registers. It would also be beneficial if there are 2-6 units for handling SAMPLE instructions, because it will have long latency. That would enable several SAMPLE instructions to be in flight at the same time.

It would work great with a texture cache that I previously described.

You can add this to that bug tracker.

Here is an even better variation:

A SAMPLE instruction takes as inputs:
- a pixel format of the texture
- the address of the texture in memory
- texture pitch
- (x,y) sample coordinates, floating point, in the texture native coordinate space

The result is an RGB(A) sample.

Then, you also need a separate instruction to help computing the (x,y) sample coordinates, because they likely need to be converted to texture coordinate space.

Some textures (maybe all textures) are tiled on triangles (the texture has finite site, but it is tiled to get an infinite texture in botx x and y axis).

To support that option, the instruction for transforming into texture coordinates can take as inputs the (logical) texture width and height, and then perform the modulo operation on final coordinates (texture space) to produce the tiling effect.