Texture maps need to be addressed by FP numbers and interpolated.
The maps are big and regularly arranged, and the loading on memory best covered by a separate instruction path.
Also with interpolation being involved they are best done as special instructions which perform the interpolation without needing extra CPU load.
Background. Earlier in thread discusses how LD/ST memory AGEN misses are affected by the regularity of the Texture maps.
(Edit 12aug2019) extra context:
This is one of the bigger tasks (several months of work) due to the sheer number of different texture formats that need to be implemented, so it will definitely need some budget, which I'll let Luke allocate.
Since I'm planning on implementing all the required formats in Rust as part of the Vulkan driver (since they need to work on x86 as well), we can split that out into a separate library to use for implementing the HW decoding as well. I think I could write it in such a way that we can generate both nmigen and LLVM IR from the same code. Additionally, the Rust code can be used as a Python library for testing purposes.
I think we should support YCbCr textures, since they are commonly used in video formats.
We will need to decide which compressed texture formats to support, Vulkan requires supporting at least one of these three:
* BC formats (also called S3TC and DXTn/DXTC)
common on desktops
effectively required for OpenGL
oldest, I think the patents have expired in 2018
* ETC2 and EAC formats
required for OpenGL
* ASTC LDR formats
common on mobile devices
best compression ratios
ASTC also has support for HDR formats.
If we can, I'd like to support all four.
(In reply to Jacob Lifshay from comment #2)
> This is one of the bigger tasks (several months of work) due to the sheer
> number of different texture formats that need to be implemented, so it will
> definitely need some budget, which I'll let Luke allocate.
it sounds... like, to be honest, we need to have a serious think about
> Since I'm planning on implementing all the required formats in Rust as part
> of the Vulkan driver (since they need to work on x86 as well), we can split
> that out into a separate library to use for implementing the HW decoding as
> well. I think I could write it in such a way that we can generate both
> nmigen and LLVM IR from the same code.
that would be really good.
> Additionally, the Rust code can be
> used as a Python library for testing purposes.
as long as doing so does not interfere with formal proofs (clearly yosys
cannot be made to critically depend on either python or a rust library).
we have sfpy as a link to softfloat: "parallel mirroring" during simulations
and checking simulated results against outside resources *is* the entire
point of working in python, it's extremely convenient.
> I think we should support YCbCr textures, since they are commonly used in
> video formats.
as we're doing a VPU as well, yes.
yes on supporting as many texture formats as possible: we may however
need to seriously prioritise.
Do we have a link to the relevant Vulkan API for textures? So we know what to interpolate?
(In reply to Luke Kenneth Casson Leighton from comment #4)
> Do we have a link to the relevant Vulkan API for textures? So we know what
> to interpolate?
it's somewhat spread throughout the vulkan spec, will find relevant links.
one thing to watch out for:
apparently ASTC may not be open source: https://www.phoronix.com/scan.php?page=news_item&px=ASTC-Restrictive-License
Ok good find.
(In reply to Jacob Lifshay from comment #7)
> one thing to watch out for:
> apparently ASTC may not be open source:
will need to confirm openness, Khronos claims it's royalty-free:
(In reply to Jacob Lifshay from comment #9)
> (In reply to Jacob Lifshay from comment #7)
> > one thing to watch out for:
> > apparently ASTC may not be open source:
> > https://www.phoronix.com/scan.php?page=news_item&px=ASTC-Restrictive-License
> will need to confirm openness, Khronos claims it's royalty-free:
ASTC effectively is now open: https://www.phoronix.com/scan.php?page=news_item&px=Arm-ASTC-Encoder-Apache-2.0
phoronix discussion idea from xfcemint https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1207878-libre-soc-still-persevering-to-be-a-hybrid-cpu-gpu-that-s-100-open-source?p=1208141#post1208141
[QUOTE=xfcemint;n1208141]The problem with adding a texure unit is that it is a lot of work.
It is much, much easier to just use a special instruction for bilinear filtering.
So, for start, perhaps it is a better idea to not use a texture unit.[/QUOTE]
this sounds exactly like the kind of useful strategy that would get us some reasonable performance without a full-on approach, and, as a hybrid processor it would fit better and it's also much more along the RISC strategy. thank you for the suggestion.
You can replace the entire functionality of a texture unit by a few special instructions in your CPU:
1. an custom instruction for bilinear filtering (you already have that)
2. an instruction for 2x2 matrix transform
3. an instruction to load a 2x2 pixel block from a texture.
About item 3, you can do some complex stuff there if you want. For example, you can postulate that textures are stored as 4x4 blocks of pixels, alligned, and the instruction has to handle that. The additional complexity is that the instruction may need to load pixel data from multiple pixel blocks.
There is one more thing required to replace the functionality of a texture unit: a texture cache. A texture cache should be shared by all GPU cores on a die, the cache is read-only, and it has direct access to memory. Each CPU-GPU has special LOAD instruction(s) to load data from the texture cache.
A texture cache does not need to be very fast (as long as your OoO engine can find other stuff to do while waiting for data from the cache). A benefit of a texture cache is that it reduces the required bandwidth to main memory.
A texture unit has read-only access to the memory. All textures are basically just a huge array of constants. I think you don't need to monitor that by the OoO unit, because there is absolutely no need to monitor constants. Even if a texture gets corrupted, there is no problem: it's just one pixel having a wrong color. Nobody notices that.
You need to have at most one texture unit per core. In your case, it will be exactly one texture unit, for simplification.
A texture unit can be pipelined or not pipelined. A non-pipelined unit would accept a SIMD request to produce about 8 samples.
The inputs for a request are:
- texture address,
- texture width and height in pixels, pitch,
- pixel format,
- (x,y) sample coordinates, x8 for 8 samples
- optionally, a transformation matrix for x,y coordinates
In some texture units, all the samples in a single request must be from the same texture. I think that is not strictly necessary, but it probably reduces the complexity of the texture unit.
A texture unit usually stores a block of 4x4 pixel data in a single cache line. The textures in the GPU memory use the same format: 4x4 pixel blocks. Textures might also use Lebesgue curve. So, there is 16 pixels in a block but it doesn't have to be in RGBA format. "Pixel format" can be something really crazy. That's how texture compression works. It reduces memory bandwidth.
The problem of adding a texture unit to your design is to figure out how to keep it utilized, because shaders don't do texture sampling all the time. When shaders are doing something else, the texture unit is doing nothing, wasting power.
What latency should the texture unit have? Should it be a low-latency, SIMD, non-pipelined design, or a high-latency, pipelined design?
Wait, I have an even better idea.
Instead of having three sepatare instructions to replace a texture unit (bilinear interpolation, coordinate transform, LOAD from texture), you would be better off with a single instruction.
You add a custom instruction SAMPLE which does all the three mentioned things together. Maybe you can drop out the coordinate transform as a separate instruction, perhaps, if you find it necessary.
So a SAMPLE instruction need all the inputs that I mentioned previously. As a result, it produces a bilinearily interpolated RGB(A?) sample from a texture. Such an instruction would be a great fit to your architecture. It does a lot of work in a single instruction, so that reduces the instruction issue bottleneck and the pressure on registers. It would also be beneficial if there are 2-6 units for handling SAMPLE instructions, because it will have long latency. That would enable several SAMPLE instructions to be in flight at the same time.
It would work great with a texture cache that I previously described.
You can add this to that bug tracker.
Here is an even better variation:
A SAMPLE instruction takes as inputs:
- a pixel format of the texture
- the address of the texture in memory
- texture pitch
- (x,y) sample coordinates, floating point, in the texture native coordinate space
The result is an RGB(A) sample.
Then, you also need a separate instruction to help computing the (x,y) sample coordinates, because they likely need to be converted to texture coordinate space.
Some textures (maybe all textures) are tiled on triangles (the texture has finite site, but it is tiled to get an infinite texture in botx x and y axis).
To support that option, the instruction for transforming into texture coordinates can take as inputs the (logical) texture width and height, and then perform the modulo operation on final coordinates (texture space) to produce the tiling effect.