https://bugs.libre-soc.org/show_bug.cgi?id=230#c30 https://libre-soc.org/openpower/sv/example_dep_matrices/ https://libre-soc.org/openpower/sv/mv.vec/ a type of Function Unit which is within the Vector regfile space, yet has connections to all four lanes i.e. normally where there would be four separate parallel Vector FUs with no interconnection this one has operand ports for all of them. This would total a whopping 12 64 bit incoming operands and 12 outgoing ones. in the middle is a massive shift register that takes in data from all four lanes and can perform the incoming and outgoing merging and splitting. different elwidths result in incoming and outgoing data either being shifted at different rates or simply taking longer, depending on resources and gate count.
moved from bug #230 (In reply to Jacob Lifshay from bug 230 comment #52) > So, I think variable shifts will be needed for other stuff, but we should > just use the dedicated texture read instructions here. arg ok, so just to be clear about this: large amount of lane-crossing is not ok. we are not going to be putting in QTY N massive 64 bit 16-to-16 multi ported crossbars for example. however if we do *not* do so then the power consumption of moving data in and out of registers will also hit us. large shift registers and cyclic buffers on the other hand are a compromise that trades latency yet still provides throughput. particularly given that 444 exists in VK and i assume that is packed we need to be able to unpack large sequences of those in a sane fashion. that basically means that large shift register, with the data "passing by" the appropriate 64 bit "Lane", extracting the R or the G or the B at the appropriate time-slot. yes, not involving crossbar routing to get at the data. the input operands are organised as: * Lane0 src dest * Lane1 src dest * Lane2 src dest * Lane3 src dest and consequently if the 4-4-4 data comes in on Lane0 and needs to be converted to FP64 it will be another 2 clock cycles to shift to where Lane2 can read the B value. this is going to be sufficiently complex that we do not want to be duplicating it or underestimating it. so to emphasise here: LD-textures needs to be fundamentally aware that it cannot naively lane-cross its data. we will almost certainly have to use micro-coding, perform some straight aligned 64 bit LDs then back-end that into the Large ShiftReg FSM to perform the actual extraction. if that's not sequential data it's going to be hell, sigh.
https://libre-soc.org/openpower/isa/fixedshift/ fixed shift (1-in 1-out) can fit into the FSM as well.
(In reply to Luke Kenneth Casson Leighton from comment #1) > and consequently if the 4-4-4 data comes in on Lane0 and needs to be > converted to FP64 it will be another 2 clock cycles to shift to where Lane2 > can read the B value. Texture read won't need to produce fp64 values unless the pixels are stored in fp64 (very uncommon). Plan on texture reads being able to efficiently be random access, though in practice texture accesses will often be near previous accesses (though not sequential) so caching will be needed.
(In reply to Jacob Lifshay from comment #3) > (In reply to Luke Kenneth Casson Leighton from comment #1) > > and consequently if the 4-4-4 data comes in on Lane0 and needs to be > > converted to FP64 it will be another 2 clock cycles to shift to where Lane2 > > can read the B value. > > Texture read won't need to produce fp64 values unless the pixels are stored > in fp64 (very uncommon). The common data types produced by a texture read are f32 and f16 (only if f16 arithmetic is supported), and signed/unsigned 32-bit integer types.
(In reply to Jacob Lifshay from comment #3) > Texture read won't need to produce fp64 values unless the pixels are stored > in fp64 (very uncommon). whew. > Plan on texture reads being able to efficiently be random access, ok so if it's a single access, then that can go into the standard Vector FUs (as a single VL=1) or Scalar FUs, no problem. a bit of resources will be wasted, you'll be reading say 16 bits and dropping them into the lower word of a 64 bit reg. byte-level masking on the regfile takes care of that. so all no problem there: the reg #s can be allocated to make sure that RT%4 == RA%4 == RB%4 (the conditions for getting into the Vector FUs, even if VL=1). or, if that's not a priority, just be happy that the LD went into the scalar FUs and accept the performance hit. the problem comes when reading a longer sequence (VL>1, SUBVL>1) where the operation crosses beyond a 64-bit (single reg). this may be more of an issue for Video/Audio than it is for Texturisation. > though in > practice texture accesses will often be near previous accesses (though not > sequential) so caching will be needed. yes.
(In reply to Luke Kenneth Casson Leighton from comment #5) > (In reply to Jacob Lifshay from comment #3) > > > Texture read won't need to produce fp64 values unless the pixels are stored > > in fp64 (very uncommon). > > whew. > > > Plan on texture reads being able to efficiently be random access, > > ok so if it's a single access, then that can go into the standard Vector FUs > (as a single VL=1) or Scalar FUs, no problem. > > a bit of resources will be wasted, you'll be reading say 16 bits and > dropping them into the lower word of a 64 bit reg. byte-level masking on > the regfile takes care of that. > > so all no problem there: the reg #s can be allocated to make sure that RT%4 > == RA%4 == RB%4 (the conditions for getting into the Vector FUs, even if > VL=1). or, if that's not a priority, just be happy that the LD went into > the scalar FUs and accept the performance hit. > > the problem comes when reading a longer sequence (VL>1, SUBVL>1) where the > operation crosses beyond a 64-bit (single reg). That's the least of your problems, since each texture access can access 4 or 8 (or even more -- think 128 for anisotropic filtering) texels and linearly interpolate between them. We will probably want a dedicated texture decode pipeline, since texture access is the most critical part for GPU speed since multiple of those full texture accesses can occur per drawn pixel. Luckily texture interpolation only happens for fp16/fp32 results, so integer results are just 1 texel access.
(In reply to Jacob Lifshay from comment #6) > That's the least of your problems, since each texture access can access 4 or > 8 (or even more -- think 128 for anisotropic filtering) texels and linearly > interpolate between them. We will probably want a dedicated texture decode > pipeline, since texture access is the most critical part for GPU speed since > multiple of those full texture accesses can occur per drawn pixel. ok let's discuss that under a separate bugreport... there's a texture one around somewhere... bug #91 and keep an eye on this one from here. > Luckily texture interpolation only happens for fp16/fp32 results, so integer > results are just 1 texel access. wheww