Bug 230 - Video opcode development and discussion
Summary: Video opcode development and discussion
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: cand
URL:
Depends on: 557
Blocks: 137 231
  Show dependency treegraph
 
Reported: 2020-03-13 09:59 GMT by cand
Modified: 2021-01-18 15:22 GMT (History)
4 users (show)

See Also:
NLnet milestone: NLNet.2019.Video
total budget (EUR) for completion of task and all subtasks: 4000
budget (EUR) for this task, excluding subtasks' budget: 2000
parent task for budget allocation: 137
child tasks for budget allocation: 557
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description cand 2020-03-13 09:59:54 GMT
Video opcode development and discussion is needed, as well as research
and informal write-up.

https://libre-soc.org/openpower/sv/av_opcodes/
Comment 1 Luke Kenneth Casson Leighton 2020-12-08 14:12:49 GMT
Lauri we should kick this one into high gear and establish a wiki page for it, with sections for audio and video.

from when i looked closely at the AndesSTAR DSPs which are designed for audio processing i noticed a predominance of word-swapping (which can use swizzle in our case), clipping and saturation, as well as some operations which were effectively 17-bit ADD followed by shift-right by one:

    res = (x + y) // 2

fascinatingly the SIMD VSX openpower ISA also has similar.

i also noted that VSX has 16-bit (1555 RGB) to 32-bit (8888 RGB) bidirectional instructions.

can we start to go through these and establish, at the *scalar* level, what you will need.
Comment 2 cand 2020-12-08 15:07:28 GMT
Before the binutils and simulator support exist, can't write the baseline simple-v code and so see which new instrs would be of most use. Otherwise it's just guesses, though I mentioned saturated math way back already. Or are you asking which parts of VSX are useful and which aren't?

The pixel VSX type is very rarely used. Personally I've never used it. Hardly anybody uses 16-bit color spaces nowadays, outputs are 24 (8-8-8) or even higher like 30 bit (10-10-10). It may not even have much use for textures, when a game needs to save bw it probably uses 5-6-5, 8-8 or fp16, not 5-5-5-1.
Comment 3 Luke Kenneth Casson Leighton 2020-12-10 14:49:31 GMT
(In reply to cand from comment #2)
> Before the binutils and simulator support exist, can't write the baseline
> simple-v code and so see which new instrs would be of most use. 

well... given that effectively SV is a hardware for-loop around scalar
code, could we start with the scalar inner loops, compile them to standard scalar assembler, and take a look?

(more to the point: if that's *not* possible then it is a "red flag")

i was going to ask you if you could put the core inner loop of each algorithm onto wiki pages (in straight c) as a first step.

> Otherwise
> it's just guesses, though I mentioned saturated math way back already.

ok. yes. although it's now "time" to start listing them.

> Or are you asking which parts of VSX are useful and which aren't?

mmm in terms of extracting from VSX, we're going to get questions,
"why are you not just using VSX" and for that i'd like - at some point -
to have a list of "equivalent ops" where anyone familiar with VSX can
see the SV alternatives, as (a) an easy cheat-sheet (b) an insight showing
that there is in fact direct equivalance (even if macro-op fusion is needed)

> The pixel VSX type is very rarely used. Personally I've never used it.
> Hardly anybody uses 16-bit color spaces nowadays, outputs are 24 (8-8-8) or
> even higher like 30 bit (10-10-10). It may not even have much use for
> textures, when a game needs to save bw it probably uses 5-6-5, 8-8 or fp16,
> not 5-5-5-1.

yes i kinda got that impression, i was wondering "where the heck are all the
other encodings" and the original (VMX) was developed between 1996 and 1998 (!) with versions uprating that in 2006 (v2.03) and again in 2009 (v2.06).

what i'd like to do here is to check that general-purpose bitmanip could be used to cover all these types of pixel encodings, and to do a quick check / estimate on the pixels/clock rate.
Comment 4 cand 2020-12-10 15:31:59 GMT
> given that effectively SV is a hardware for-loop around scalar
> code, could we start with the scalar inner loops, compile them to standard 
> scalar assembler, and take a look?
> i was going to ask you if you could put the core inner loop of each algorithm 
> onto wiki pages (in straight c) as a first step.

I don't think that's an efficient use of time.

The paradigms in the C and SIMD code for the loops often differ widely. The C is for understanding and correctness, SIMD for whatever is fastest for the platform. As such, basing the instruction guesses on the C could be suboptimal.

The other part is complexity. The more developed codecs are highly complex, and having to page the stuff in and out of my head, without being able to write the actual code how it'd be targeted (SV, not PPC scalar, VSX, etc), would not be efficient.

I can go over the VSX funcs and check whether the idea under each is useful.

> what i'd like to do here is to check that general-purpose bitmanip could be 
> used to cover all these types of pixel encodings, and to do a quick check / 
> estimate on the pixels/clock rate.

It's just shifts, ands and ors. Mul, madd, saturated add, narrowing and possibly LUTs if doing different colorspaces or gamma correction. All trivially parallelizable, so it depends on how many ops the unit can do in parallel.
Comment 5 cand 2020-12-10 18:24:34 GMT
The useful VSX parts are now on the wiki.
Comment 6 Luke Kenneth Casson Leighton 2020-12-10 19:05:12 GMT
(In reply to cand from comment #4)
> > given that effectively SV is a 
> I don't think that's an efficient use of time.
> 
> The paradigms in the C and SIMD code for the loops often differ widely. The
> C is for understanding and correctness, SIMD for whatever is fastest for the
> platform. As such, basing the instruction guesses on the C could be
> suboptimal.

some SIMD ops which are highly weird yet do valuable tasks, i totally get.  others are simply "N scalar ops".

needs thought at my end.  gimme time.

 
> The other part is complexity. The more developed codecs are highly complex,
> and having to page the stuff in and out of my head, without being able to
> write the actual code how it'd be targeted (SV, not PPC scalar, VSX, etc),
> would not be efficient.
> 
> I can go over the VSX funcs and check whether the idea under each is useful.
> 
> > what i'd like to do here is to check that general-purpose bitmanip could be 
> > used to cover all these types of pixel encodings, and to do a quick check / 
> > estimate on the pixels/clock rate.
> 
> It's just shifts, ands and ors. 

my concern is: how many? if that's 4 instructions in an inner loop then that could well be 25% of the pixel/clock ratio that could be achieved if there was a dedicated op that did the same.

but if pixel conversion is only say 1% of the total of an outer loop, that doesn't matter.  if however the pixel conversion is 30% of total time *now* it matters.


> Mul, madd, saturated add, narrowing and
> possibly LUTs if doing different colorspaces or gamma correction. All
> trivially parallelizable, so it depends on how many ops the unit can do in
> parallel.

this reminds me of the Khronos colour conversion which us nonlinear.  jacob mentioned creating a LUT operation to cover this.

(In reply to cand from comment #5)
> The useful VSX parts are now on the wiki.

saw. thank you.  will look at later  got meeting now
Comment 7 Jacob Lifshay 2020-12-10 19:17:29 GMT
(In reply to Luke Kenneth Casson Leighton from comment #6)
> this reminds me of the Khronos colour conversion which us nonlinear.  jacob
> mentioned creating a LUT operation to cover this.

yeah, 8-bit sRGB <-> linear
Comment 8 Jacob Lifshay 2020-12-10 19:20:18 GMT
(In reply to cand from comment #5)
> The useful VSX parts are now on the wiki.

Did you go through just the specifically VSX instructions or also the vector instructions? IIRC there are some important instructions that aren't listed under VSX -- mostly ones that use the full 128-bit register as a single value -- like AES encrypt step.
Comment 9 Luke Kenneth Casson Leighton 2020-12-11 00:48:46 GMT
(In reply to cand from comment #5)
> The useful VSX parts are now on the wiki.

done review, listed opportunities for macro op fusion, where scalar ops are missing, where SV covers them because the scalar op exists (madd, popcnt)

(In reply to Jacob Lifshay from comment #8)

> instructions? IIRC there are some important instructions that aren't listed
> under VSX -- mostly ones that use the full 128-bit register as a single
> value -- like AES encrypt step.

this bugreport is for evaluating audio and video, so i would not expect Lauri to review the Rijndael sub-functions.

at some point we will take what is learned here and merge it into a full ISA proposal page.
Comment 10 Jacob Lifshay 2020-12-11 01:04:01 GMT
(In reply to Luke Kenneth Casson Leighton from comment #9)
> (In reply to Jacob Lifshay from comment #8)
> 
> > instructions? IIRC there are some important instructions that aren't listed
> > under VSX -- mostly ones that use the full 128-bit register as a single
> > value -- like AES encrypt step.
> 
> this bugreport is for evaluating audio and video, so i would not expect
> Lauri to review the Rijndael sub-functions.

yup, just couldn't think of a better example.

> at some point we will take what is learned here and merge it into a full ISA
> proposal page.

at which point AES step will be included.
Comment 11 cand 2020-12-11 07:53:46 GMT
> my concern is: how many? if that's 4 instructions in an inner loop then that 
> could well be 25% of the pixel/clock ratio that could be achieved if there was 
> a dedicated op that did the same.

Talking from ffmpeg VSX experience, if everything else is accelerated but not the final conversion, it does take 30-40%. But that's scalar. When SIMDed, it gets almost perfect scaling, say 7.8x when perfect is 8x. So in parallel, it's then reduced to ~5% with just 8 units.

The point: hardcoding the conversion may bring a constant speedup of say 4x. Parallelization brings a speedup of N, how many ops the cpu can do in parallel. So for most formats, just have many units. (for specific heavy and very often used things, like yuv <-> rgb, we do want both)

> Did you go through just the specifically VSX instructions or also the vector 
> instructions? IIRC there are some important instructions that aren't listed 
> under VSX -- mostly ones that use the full 128-bit register as a single value -
> - like AES encrypt step.

I went through the full 3.0/power9 vector instr set (altivec + vsx + exts), but ignored crypto since it's not relevant to AV. AES/SHA/etc accel belongs to other parts. The power8 and power9 additions are mostly just new types to existing functions.
Comment 12 cand 2020-12-11 08:05:40 GMT
Commented on the horizontal mapreduce.

> That has the issue that's it's a massive PITA to code, plus it's slow. Plus 
> there's the "access to non-4-offset regs stalls". Even if there's no ready 
> operation, it should be made easier and faster than a manual mapreduce loop.
Comment 13 Luke Kenneth Casson Leighton 2020-12-11 16:24:20 GMT
(In reply to cand from comment #12)
> Commented on the horizontal mapreduce.
> 
> That has the issue that's it's a massive PITA to code, plus it's slow. Plus 
> there's the "access to non-4-offset regs stalls".

i realised i made an error in the assessment.  let us assume that there is a (very long) vector in memory that needs to be mapreduced/multiplied.  N=1000 for example.  the implementation would be a pair of while-loops, where the MVL (Max Vector Length) is set to a *fixed* amount, taking up a fixed range of the register file.

* the inner loop simply performs standard VL-driven multiplies, 8 at a time (or
  whatever MVL is set to) storing results in a secondary array

* at the end of that inner loop, the remaining (non-power-of-two) amount which
  can be anywhere between 0 and 7 results in a perfectly-acceptable SIMD
  auto-predicated operation.  no complexity at the source code level, here.

* the outer loop now kicks in, making the secondary array the primary array
  and entering the inner loop, once again, but with *half* the number of
  elements.

* finally, N (the length of our ever-halving array) reaches a value between
  0 and 7, at which point we end up with some rather small VLs.  this is
  the *only* point at which a little more thought is needed, to deal with
  things like 6 3 2 1.

* if N was in fact limited to e.g. 4 rather than 8, the "last cleanup" is
  extremely small.

* (actually for the very small end-of-loop it would probably be better to
  just do a series of accumulator multiplies and not worry about it, let
  the OoO issue engine deal with it).

coding that up at the *hardware* level fills me with some... trepidation
(see below).

if it is at the *software* level, note that there is *no* explicit decision
to reduce VL by half each time (this was a mistake to suggest doing that),
you use the fact that VL is set to MIN(MVL, RA) where RA is the "desired" (maximum) number of elements wished to have multiplies executed.


>  Even if there's no ready 
> operation, it should be made easier and faster than a manual mapreduce loop.

sigh :)  the issue is that we've got a hugely complex system already, where
SV implements, in between the issue and execution phase, the following loop:

function op_add(rd, rs1, rs2) # add not VADD!
  rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
  for (i = 0; i < VL; i++)
    ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2]; # <- actual operation here
    if (int_vec[rd ].isvector)  { id += 1; }
    if (int_vec[rs1].isvector)  { irs1 += 1; }
    if (int_vec[rs2].isvector)  { irs2 += 1; }

applying a mapreduce algorithm *could* be done but it's yet another outer loop
even on top of that.

and, more than that: where do you store the intermediary results used in between levels in the mapreduce?  (they certainly can't go in the regfile)

how do you deal with the dependencies on those intermediaries?

what happens if there's an interrupt in the middle of that?  context-switching needs to swap *everything*... where do those intermediaries go?  (they can't go into the regfile)

so the answers to these questions - the complexity that's introduced when trying to implement it - is why i've stayed away from this one.

to put a ballpark figure on timescales for a proper implementation (start to finish, including Specifications) i'd put about 4-5 months on this one task alone.

for dotproduct we may *have* to implement something like it, but this is an outlier and, again, i'm inclined to defer it because of the complexity.
Comment 14 cand 2020-12-11 16:33:33 GMT
Consider how long it took for you to think it through and how long the text describing it is. Now consider it is a one-line operation in SIMD ISAs. This means nobody will bother coding it in SV, handicapping it.

Now, I'm not saying we have to have it fully in hw. A mid-step that just does it for 4 or 8 elements, in one clock, would be close enough. That would hopefully also side-step the slowness in non-4-reg-stalls. That's still a loop, but just a simple one-level loop, no complexity.
Comment 15 Jacob Lifshay 2020-12-11 17:10:38 GMT
If your trying to do a giant sum-reduction and don't care that much about the exact order of add ops, the code I've seen that should be most efficient is:

const size_t HW_PAR = 32; // the number of add ops per inner loop that the hw needs to keep the pipeline full

using vec = sv_vec<float, HW_PAR>;

float reduce_add(float *in, size_t in_size) {
    // optionally do a single shorter horizontal_add for in_size < HW_PAR
    vec accumulator = splat(0.0f, VL=HW_PAR);
    while(in_size != 0) {
        size_t vl = min(HW_PAR, in_size);
        vec v = load(in, VL=vl);
        // elements with index >= vl are unmodified
        accumulator = add(accumulator, v, VL=vl);
        in += vl;
        in_size -= vl;
    }
    return horizontal_add(accumulator, VL=HW_PAR);
}
Comment 16 Luke Kenneth Casson Leighton 2020-12-11 21:13:00 GMT
(In reply to Jacob Lifshay from comment #15)
> If your trying to do a giant sum-reduction and don't care that much about
> the exact order of add ops, the code I've seen that should be most efficient
> is:

an algorithm that performs a fixed set of parallel laned ADDs.  duh.  so simple.

(In reply to cand from comment #14)
> Consider how long it took for you to think it through and how long the text
> describing it is.

don't tell no-one, i'm not actually very good at coming up with algorithms on-the-fly.  usually it takes several days to weeks of thought :)

> Now consider it is a one-line operation in SIMD ISAs.

SIMD ISAs have the "advantage" of being able to hard-code the mapreduce tree (or parallel accumulator algorithm) because the SIMD size is fixed and known.

> This
> means nobody will bother coding it in SV, handicapping it.

due to the underlying hardware complexity i am still inclined to defer it unless it can be shown to be a significant performance hindrance by not having it.

keep it in mind: if you find an algorithm that's severely compromised, performance-wise, we can look at it.

that's the whole development ethos here: focus first on optimising the low hanging fruit.


> Now, I'm not saying we have to have it fully in hw. A mid-step that just
> does it for 4 or 8 elements, in one clock, would be close enough.

ok so there *is* one possibility, there: a special vec4 only operation that takes 4x elements and multiplies them all together, targetting a non-vec4 dest.

however these would be very specialist operations, that i would like to defer until "Phase 2" i.e. after the initial implementation of SV looping.
Comment 17 Jacob Lifshay 2020-12-11 21:24:57 GMT
(In reply to Luke Kenneth Casson Leighton from comment #16)
> SIMD ISAs have the "advantage" of being able to hard-code the mapreduce tree
> (or parallel accumulator algorithm) because the SIMD size is fixed and known.

That can be done in SV as well, since MAXVL is fixed and known, it's just less efficient unless VL is close to MAXVL.
Comment 18 Luke Kenneth Casson Leighton 2020-12-11 22:47:00 GMT
(In reply to Jacob Lifshay from comment #17)
> (In reply to Luke Kenneth Casson Leighton from comment #16)
> > SIMD ISAs have the "advantage" of being able to hard-code the mapreduce tree
> > (or parallel accumulator algorithm) because the SIMD size is fixed and known.
> 
> That can be done in SV as well, since MAXVL is fixed and known, 

in RVV, yes, MAXVL is hardcoded to the number of microarchitectural lanes.

in SV, MAXVL sets the maximum length of the allocation of the regfile that may be used for the vector operation *and may be set to anywhere between 1 and 64* at runtime, at any time.  MAXVL=1 says "all ops are flatlined to scalar".

consequently, no, MAXVL may not be assumed to be hard-coded to 4, 8, or 16, as would be the case for RVV or any SIMD microarchitecture.

thus we cannot hardcode a series of mapreduce crossbars or allocate some hardcoded lane paths (called "wires"), like a SIMD microarchitecture would.

instead we need a micro-coded approach, and that's the CISC path which i would prefer we deprioritise until we get more information.
Comment 19 Jacob Lifshay 2020-12-11 23:16:42 GMT
(In reply to Luke Kenneth Casson Leighton from comment #18)
> (In reply to Jacob Lifshay from comment #17)
> > (In reply to Luke Kenneth Casson Leighton from comment #16)
> > > SIMD ISAs have the "advantage" of being able to hard-code the mapreduce tree
> > > (or parallel accumulator algorithm) because the SIMD size is fixed and known.
> > 
> > That can be done in SV as well, since MAXVL is fixed and known, 
> 
> in RVV, yes, MAXVL is hardcoded to the number of microarchitectural lanes.
> 
> in SV, MAXVL sets the maximum length of the allocation of the regfile that
> may be used for the vector operation *and may be set to anywhere between 1
> and 64* at runtime, at any time.  MAXVL=1 says "all ops are flatlined to
> scalar".

MAXVL is fixed and known at compile time, not at cpu-design-time. The register allocator has to allocate the backing registers after all. And, no, changing maxvl at runtime is *not* correct, since the compiler expects it to be set to a decided-by-the-compiler constant for every instruction in the program, where it can be a different constant for different instructions.

since the compiler knows the maxvl, it *can* generate a reduction tree at compile time. it doesn't need to be a power-of-2.

Adding a microcoded reduce op will allow optimizing for VL which is a runtime-variable instead of maxvl, as well as reducing extraneous moves and reducing power. Also, a hw-level reduce op could use a 4-way add-reduce ALU, kinda like the adder tree part of a wallace tree multiplier.
Comment 20 Luke Kenneth Casson Leighton 2020-12-11 23:51:17 GMT
(In reply to Jacob Lifshay from comment #19)

> MAXVL is fixed and known at compile time, not at cpu-design-time.

sorry, we're talking cross-purposes.  at compile-time, yes, however there will be different MAXVLs for different programs.  i was talking in general about the fact that SIMD is effectively a hard-fixed MAXVL.

if we were to say "hmm we expect that *most* operations will have a MAXVL of 16 therefore we can create a hard-coded mapreduce network of wires to use for performing hard-coded 16-long mapreduces" this is not going to fly.

> Adding a microcoded reduce op will allow optimizing for VL which is a
> runtime-variable instead of maxvl, as well as reducing extraneous moves and
> reducing power. Also, a hw-level reduce op could use a 4-way add-reduce ALU,
> kinda like the adder tree part of a wallace tree multiplier.

indeed... and, again: back to comment #13 this requires that the micro-coded intermediate ops have a storage location for the temporary results.

(putting the temporary results in the regfile is not acceptable)

thinking that through is an exercise that i have done many times.  all designs involving temporary result storage require far more complexity than i am comfortable with committing to within an already heavily pressured timeframe where we are at least 8 months behind where we should be.
Comment 21 Cole Poirier 2020-12-12 00:03:47 GMT
(In reply to Luke Kenneth Casson Leighton from comment #20)
>
> indeed... and, again: back to comment #13 this requires that the micro-coded
> intermediate ops have a storage location for the temporary results.
> 
> (putting the temporary results in the regfile is not acceptable)
> 
> thinking that through is an exercise that i have done many times.  all
> designs involving temporary result storage require far more complexity than
> i am comfortable with committing to within an already heavily pressured
> timeframe where we are at least 8 months behind where we should be.

Would scratch-pad memories be applicable in this instance?
Comment 22 Luke Kenneth Casson Leighton 2020-12-12 01:23:53 GMT
(In reply to Cole Poirier from comment #21)

> > thinking that through is an exercise that i have done many times.  all
> > designs involving temporary result storage require far more complexity than
> > i am comfortable with committing to within an already heavily pressured
> > timeframe where we are at least 8 months behind where we should be.
> 
> Would scratch-pad memories be applicable in this instance?

probably depends on the context for which you're thinking scratchpad memories might be useful.

if they are for a single dedicated purpose, for use by one thread and one thread only, then they need not be protected by context-switching or anything at all.

any such scratchpad memory, if considered for use for the explicit purpose of context-switching (to save for example the above-mentioned "temporary registers") then its designation actually switches from "scratchpad memory" to "a type of register file".  which we already said isn't ok to add, because of the complexity it introduces in its management during context-switching.

the other option is to throw away the entirety of the micro-op up until that point, and restart it once the context-switch returns.  i dislike these types of wasteful "solutions".... and, again, they actually have to have resources dedicated to implementing them.

*whatever* is proposed, there will be an evaluation cost, a design cost, an implementation cost and a testing cost.

we do not have time for that on something that we have no input yet as to whether it is 1% used, 10% used or 30% used.
Comment 23 Cole Poirier 2020-12-12 04:19:04 GMT
(In reply to Luke Kenneth Casson Leighton from comment #22)
> (In reply to Cole Poirier from comment #21)
> 
> > > thinking that through is an exercise that i have done many times.  all
> > > designs involving temporary result storage require far more complexity than
> > > i am comfortable with committing to within an already heavily pressured
> > > timeframe where we are at least 8 months behind where we should be.
> > 
> > Would scratch-pad memories be applicable in this instance?
> 
> probably depends on the context for which you're thinking scratchpad
> memories might be useful.
> 
> if they are for a single dedicated purpose, for use by one thread and one
> thread only, then they need not be protected by context-switching or
> anything at all.

This is indeed what I was thinking of.


> any such scratchpad memory, if considered for use for the explicit purpose
> of context-switching (to save for example the above-mentioned "temporary
> registers") then its designation actually switches from "scratchpad memory"
> to "a type of register file".  which we already said isn't ok to add,
> because of the complexity it introduces in its management during
> context-switching.

Was not thinking of this, I don't have an idea of how to solve the context-switching state preservation issue of 1000 long operations. Yes indeed, preservation across context switch is reg file not scratch pad.

> the other option is to throw away the entirety of the micro-op up until that
> point, and restart it once the context-switch returns.  i dislike these
> types of wasteful "solutions".... and, again, they actually have to have
> resources dedicated to implementing them.

In a multi-core system would it be possible to assign such long-running operations to a single core non-preemptably except for interrupts? That would possibly have such operations run to completion on a majority of attempts, but still respect system interrupt priority.
 
> *whatever* is proposed, there will be an evaluation cost, a design cost, an
> implementation cost and a testing cost.
>
> we do not have time for that on something that we have no input yet as to
> whether it is 1% used, 10% used or 30% used.

Ahh, went and re-read the bug report. I thought this was something that needed a solution in order to proceed, I remembered after refreshing my memory through re-reading the bug that this is essentially an enhancement with great cost associated with it. Discard my comments.
Comment 24 cand 2020-12-12 07:49:55 GMT
Yeah, the full arbitrary-width is not viable. And I know you find fixed-width ops ugly, but in this case it may be necessary.

> ok so there *is* one possibility, there: a special vec4 only operation that takes 4x elements and multiplies them all together, targetting a non-vec4 dest.
>
> however these would be very specialist operations, that i would like to defer until "Phase 2" i.e. after the initial implementation of SV looping.

I realized a horizontal 4-element add would also be useful for the generic pixel pack case, since | and + are the same op when no bits overlap. It would replace the last three ORs, speeding it up by 1-2 clocks per pixel, plus the avoided stall (I'm assuming the horizontal 4-op can avoid the normal non-4-offset reg stall).

This then lead to the opposite operation too, a 1-to-4 bit scatter with shift and AND. For pixels this is mainly in video encoding, but being a generic bit op it should find use in decompression too.

If these are too special for phase 1, that's fine, since it doesn't require any changes at the SV loop level. However the speed implications are great, and I do think we need some accel for this case.
Comment 25 Luke Kenneth Casson Leighton 2020-12-12 11:35:24 GMT
(In reply to cand from comment #24)
> Yeah, the full arbitrary-width is not viable. And I know you find
> fixed-width ops ugly, but in this case it may be necessary.
> 
> > ok so there *is* one possibility, there: a special vec4 only operation that takes 4x elements and multiplies them all together, targetting a non-vec4 dest.
> >
> > however these would be very specialist operations, that i would like to defer until "Phase 2" i.e. after the initial implementation of SV looping.
> 
> I realized a horizontal 4-element add would also be useful for the generic
> pixel pack case, since | and + are the same op when no bits overlap. It
> would replace the last three ORs, speeding it up by 1-2 clocks per pixel,
> plus the avoided stall (I'm assuming the horizontal 4-op can avoid the
> normal non-4-offset reg stall).

ok a 64 bit down to 16 or even 8, no problem.  ish.  4x 64 bit, down to 1x 64 bit, this needs a micro-coding FSM to cover multiple operation generation.

it's... doable.

can you add this to the list of ops in the av_opcodes page? spec it up with some simple pseudocode even if it is 3 lines.


> This then lead to the opposite operation too, a 1-to-4 bit scatter with
> shift and AND.

oof.  ah.  this *might* be covered with bitmanip but again can you spec it up, add some pseudocode
Comment 26 cand 2020-12-12 16:58:49 GMT
Added under the msum point.
Comment 27 Luke Kenneth Casson Leighton 2020-12-12 17:15:36 GMT
(In reply to cand from comment #26)
> Added under the msum point.

star.

ok so *sigh* there's a couple of options here:

1) we actually add a 4-wide (actual SIMD) instruction.  yuk.
2) we add opcodes "horizontal-add", "horizontal-mul" which work *only* on
   vec2, vec3 and vec4.

so when you ask, "But can the SV loop increment the src reg # by 4? Hmm."
the answer is: yes... as long as it's a vec2, vec3 or vec4 as the src
regs.

the dest would remain non-vec2/3/4 which is what makes the operation extremely unusual.

the other one involves 32-bit src and 128-bit dest so i'll need to think about it.  

    rd = (rs >> 0 * 8) & (2^8 - 1)
    rd+1 = (rs >> 1 * 8) & (2^8 - 1)
    rd+2 = (rs >> 2 * 8) & (2^8 - 1)
    rd+3 = (rs >> 3 * 8) & (2^8 - 1)

this particular pattern, it's actually using the SUBVL as the immediate in the shift.
Comment 28 cand 2020-12-13 07:45:29 GMT
> the other one involves 32-bit src and 128-bit dest so i'll need to think about it.

Please tell me which part is the hangup, so I'll know for future instrs.
- implementation complexity
- complexity when trying to write 128b in one clock (so do it more, 128: two clocks, <= 64: one clock, 256: 4 clocks)
- writing that much in one instr in general

For this specific case, 32-to-32 would have use, but not the majority. Handwaving 40% would be 32-to-32, where most would be 32-to-16 (aka read 32, write 64). I'd still like to have 32-to-32, but depends on what the tradeoff is.
Comment 29 Luke Kenneth Casson Leighton 2020-12-13 15:43:04 GMT
(In reply to cand from comment #28)
> > the other one involves 32-bit src and 128-bit dest so i'll need to think about it.
> 
> Please tell me which part is the hangup, so I'll know for future instrs.

the routing, because of the different sizes of src and dest, when it comes to getting data from different parts of the register file and to Function Units

right now the plan is to have 4-way stratification of the Function Units.  so anything that comes to *and from* a register number modulo 4, no problem.

thus there will be four *separate* sets of Function Units that cannot talk to each other.

https://libre-soc.org/openpower/sv/example_dep_matrices/

anything *not* matching that pattern has go through *another* (completely separate) set of Function Units that do not have that same restriction on the reg numbering.

but

the routing in and out of that last set is going to be absolutely massive crossbars (4x4 64-bit, and we'll need up to *FIVE* of those as inputs in some cases)

by having this type of "stratified" Function Units the register file porting is kept down to sane numbers (4R1W) where if we didn't do that, we'd need insane numbers of regfile ports (12R8W) which requires huge amounts of specialist expertise to design and consume vast amounts of power.

plus: everything revolves around 64-bit maximum.

so because the numbering of src and dest do not match "modulo 4", the operations will have to go through the (bottlenecked) FUs.
Comment 30 Luke Kenneth Casson Leighton 2020-12-13 16:35:48 GMT
argh, argh, ok, i have an idea.  some of VSX's instructions are 2-in 1-out "merging" instructions, others are 1-in 2-out "splitters".

if those were added as actual instructions, which took:

* an array of vec2/3/4 as input/output
* an array of elements as output/input

and allowed this to be the *only* Function Unit that had inputs and outputs across all 4 of those "lanes" that you can see here:

   https://libre-soc.org/openpower/sv/example_dep_matrices/

and if it was done as an FSM (not a pipeline), then it could "accumulate" its inputs and push out outputs, without overwhelming the regfiles or the DMs, and not take up too much space.

it would cover the type of operation from comment #27 without affecting the scalar FU paths.

basically, operations from section 6.8.1 and 6.8.2 p254 and around there, v3.0B, but of the form:

   mv r3.vec2, r4, VL=5
   mv r4, r2.vec4, VL=8

elwidths still settable on both src and dest.
Comment 31 Luke Kenneth Casson Leighton 2020-12-13 16:54:39 GMT
something like this:

   https://libre-soc.org/openpower/sv/mv.vec/

the mapping in EXT19, when including swizzle, takes up the *entirety* of p1157 and half of p1159 v3.0B (!!)
Comment 32 Luke Kenneth Casson Leighton 2020-12-13 18:48:04 GMT
(In reply to Luke Kenneth Casson Leighton from comment #31)
> something like this:
> 
>    https://libre-soc.org/openpower/sv/mv.vec/

wow. in combination with swizzle, twin-predication, elwidth overrides, and saturation i had no idea until i looked at it how stupidly powerful and flexible this instruction is.

the only thing i just realised is, damn, it really does need to be joined by a two-in and two-out variant.

the reason is that you may have some left audio in one vector and right in another.  they need zipping up.  likewise for pixels.

l.
Comment 33 Luke Kenneth Casson Leighton 2020-12-14 06:14:57 GMT
(implementation discussion and questions at #546, this bugreport is for spec discussion) 

Lauri, Jacob, see bug #546 i think we can do a Monster FSM based FU centred around a Shift Register that handles every variety of mv by bringing in 3 of every lane regfile read ports and, depending on resources and implications for the DMs, 3 lots of all rhe write ports as well.

it'll be a handful, but also quite something to behold, in a scarily fascinating kind of way.  in principle though it is not that hard, just some timings to think through and make sure that regs are dropped off into output ports at the right time.

point is: i believe we can get the fan in/out data moving ops you were looking for, Lauri.
Comment 34 cand 2020-12-14 07:13:28 GMT
That's great and definitely useful. I'm just not sure what it means if additional functionality is desired for such ops, is that allowed or a no-no?
Comment 35 Luke Kenneth Casson Leighton 2020-12-14 07:39:59 GMT
(In reply to cand from comment #34)
> That's great and definitely useful. I'm just not sure what it means if
> additional functionality is desired for such ops, is that allowed or a no-no?

well... maybe.  saturate definitely.

as long as it is 1-in 1-out (not trying to zip or unzip) because rhat really is a new op, ordinary 1in 1out ops should actually fit, but not need new opcodes.

by that i mean, it is perfectly fine for an FU to do double duty as long as the regfile (operand) profile matches.

there is a bugreport outstanding to add OR, AND and XOR etc. to the ALU pipeline for example, but not the other way round because Logical does not have XER carry.

in this case: logical not, exts*, FP2INT, FPCVT, fabs, those all "fit" because they are 1in 1out and not a lot in the way of gates, each.

however anything that is arithmetic i am wary of due to the effect it would have on data routing.  arithmetic which takes 2 inputs and *reduces* that to 1 out is very different from something that takes 2 inputs and splices them to create the same number of bits in the output, if you know what i mean.

i am also tempted but also wary of anything that would take large immediates, although in theory it should be fine.

was there anything you had in mind?
Comment 36 cand 2020-12-14 07:43:03 GMT
I was just considering this for the previously mentioned pixel pack and unpack. For 8-bit packed mv.zip works as is, but for 5 or 10-bit, not so. Those would need the horizontal add for zip and shift-and for unzip.
Comment 37 Luke Kenneth Casson Leighton 2020-12-14 08:25:40 GMT
(In reply to cand from comment #36)
> I was just considering this for the previously mentioned pixel pack and
> unpack. For 8-bit packed mv.zip works as is, but for 5 or 10-bit, not so.
> Those would need the horizontal add for zip and shift-and for unzip.

well something like bitmanip that for example takes the SUBVL subindex as one of the inputs, if you can write up some pseudocode then it would qualify

example:

   for i in range(SUBVL):
       dest[rt+i] = bitmanip(i, immed)
Comment 38 Luke Kenneth Casson Leighton 2020-12-14 15:21:16 GMT
right. ok.  so if there's this Monster FSM with 12 incoming operands and 12 outgoing ones (minimum) and at its core is a Shift Register then performing *basic* expansion and selection is not unreasonable.  addition and ORing is probably ok (multiply is almost certainly not).

if you can come up with some sort of algorithm, perhaps involving shift-and-mask that also takes the current SUBVL index into account, then it may be worthwhile deploying.  the algorithm should involve the *one* pixel (but take into account SUBVL)

one idea to get encoding space down: specify whether the selected bits are shifted up to the MSB or left in the LSB.

the basic idea here is to see what it takes to efficiently cover as wide a range of bit-based pack/unpack operations.  if we absolutely have to then the pack/unpack data should be in another register (like vector-shift left-variable, p265 v3.0B)
Comment 39 cand 2020-12-14 16:12:17 GMT
Is the formulation in comment #27 not enough? To have it in BE, the mask needs to be shifted to the other end, in addition to adjusting the shift.

The example from 3.0b page 266 has 128 bits of 7-bit data, and their vslv has this weird byte-short focus. It's unlikely to find such a stream of data, but to unpack that in our scheme I would pre-process it to vec4 units, so that one 32bit reg has 4*7=28 bits, suitably aligned.

There's basically two types of variable bit width data: pixel formats like 5-6-5 and compressed data ala Huffman. The former is rare, and the latter would need sequential parsing to know their lengths, aka no point in vectorizing that. So I don't think variable shift/mask amounts per-element are necessary (but do point out if there's some use I missed).
Comment 40 cand 2020-12-14 17:27:31 GMT
I remembered where IBM's example is probably from, databases. There you could find a stream of 7-bit ints. VSX would need adjusting between every 128-bit block, less often than our setup, but the three vector-variable-shift instrs there have a dependency on each other. Our unpack phase can go in parallel. So we won't lose in speed per clock too much, even if databases aren't a major usage point for libresoc.
Comment 41 Luke Kenneth Casson Leighton 2020-12-14 17:35:54 GMT
(In reply to cand from comment #39)
> Is the formulation in comment #27 not enough? 

reminder:

    rd = (rs >> 0 * 8) & (2^8 - 1)
    rd+1 = (rs >> 1 * 8) & (2^8 - 1)
    rd+2 = (rs >> 2 * 8) & (2^8 - 1)
    rd+3 = (rs >> 3 * 8) & (2^8 - 1)

answer: not entirely, i was thinking along the lines of a shift-immediate-with-mask that could be specified anywhere from say 4 to 15 bits, or perhaps even encoded in fax machine format (huffman encoding) to say which bits are extracted, which are skipped.

example: from LBS0 of input:

* skip N bits select M bits place in dest1
* skip O bits select P bits place in dest2
* ....

is this practical, useful, and general-purpose? i don't know.


> To have it in BE, the mask
> needs to be shifted to the other end, in addition to adjusting the shift.
> 
> The example from 3.0b page 266 has 128 bits of 7-bit data, and their vslv
> has this weird byte-short focus. It's unlikely to find such a stream of
> data, but to unpack that in our scheme I would pre-process it to vec4 units,
> so that one 32bit reg has 4*7=28 bits, suitably aligned.
> 
> There's basically two types of variable bit width data: pixel formats like
> 5-6-5 and compressed data ala Huffman. The former is rare, and the latter
> would need sequential parsing to know their lengths, aka no point in
> vectorizing that.

well that's the beauty of having a large shift register to work on, within the FSM.  we *could* conceivably do huffman encoding or other achemes, on input from up to 12 (!) 64 bit registers.

bear in mind, it's going to be a BIG shift register, minimum 4x 64 bits possibly longer, because it needs to be able to take in inputs from up to 4x 64 bit registers at the same time.  it also needs to output a minimum of 4x 64 bit operands per clock as well.

as long as we do not go completely mad, and also let the processing be "local" it should be fine.  by "local" i mean that any kind of peocessing be limited to a 64 bit "window" onto the 256+ bit shift register.  if that "local" processing needs to get 128 bits from the SR it *must* wait for another clock cycle as the data shifts along to it.

the alternative would be to allow the "local" processing to reach out, with MUXes, to take in 128 bits or above and if there are 4 of those local 64bit units that's way too many gates.

hm just looking at this
https://en.m.wikipedia.org/wiki/Huffman_coding

it requires a lookup table.  hmm reluctant to include that. 

> So I don't think variable shift/mask amounts per-element
> are necessary (but do point out if there's some use I missed).

i honestly don't know, i'm throwing ideas out at the moment and floundering a bit :)
Comment 42 Luke Kenneth Casson Leighton 2020-12-14 17:58:14 GMT
(In reply to cand from comment #40)
> I remembered where IBM's example is probably from, databases. There you
> could find a stream of 7-bit ints. VSX would need adjusting between every

interesting.

i must apologise, i think i was looking for a different VSX instruction, one where, unlike rldimi, the shift and mask data were taken from a register:

   sh = RA[0..7]
   mask = (1<<RA[8..15]) -1

   RT |= (RB & mask) << sh

that sort of thing.  that's a last resort, i'd prefer an immediate but if it needs greater than say 12 bits then a register it has to be.

if however you are saying that even trying general purpose pixel-extraction is not worth it (to support arbitrary future pixel encodings like 10 10 10, or... i don't know) then i am happy to drop this.

one possibility though, just specify the start points (in bits) relative to each other, then follow up with a separate mask and shift op

    vec.extract(5 6 5)
    AND with immediate
    shift

the idea being here, the vec.extract takes:

* first 5 bits and places it in ra
* next 6 bits and places in ra+1
* next 5 into ra+2

a followup shift will move those up into the MSBs, the 5 shifted by 3, 6 by 2.

then if you want 10 10 10 the same thing must be done, except there it would be 10 10 10 discard2

now, as long as the massive 256+ bit SR can be told, "move everything along by 8 16 24 32 48 64" and still be useful, then i'm ok with that.

i *believe* it would do the job, to be able to have arbitrary sized data dropped in and out of it, but if we try going to a shift granularity of 1 bit on something that long it's going to get hairy.
Comment 43 Jacob Lifshay 2020-12-14 18:00:22 GMT
(In reply to cand from comment #39)
> There's basically two types of variable bit width data: pixel formats like
> 5-6-5 and compressed data ala Huffman. The former is rare, and the latter
> would need sequential parsing to know their lengths, aka no point in
> vectorizing that. So I don't think variable shift/mask amounts per-element
> are necessary (but do point out if there's some use I missed).

We will need variable shift-left/bitwise-shift-right/arithmetic-shift-right and maybe rotates too for GPU applications, so since we will already have them, you might as well use them.
Comment 44 Luke Kenneth Casson Leighton 2020-12-14 18:33:25 GMT
(In reply to Jacob Lifshay from comment #43)

> We will need variable shift-left/bitwise-shift-right/arithmetic-shift-right
> and maybe rotates too for GPU applications, so since we will already have
> them, you might as well use them.

indeed: however there are two different places.

1)  srcelwidth == destelwidth.

the normal place is in fixed-width Vector inputs.  64 bit RA/RB in or 2x32-bit SIMD RA/RB in, 64 bit out or 2x32-bit SIMD out.

applying standard variable shift-right/etc. at that point - when srcwidth == destwidth - i have no problem with putting down large blocks, QTY 4 or 8 of these, when the input data RA % 4 == RB % 4 == RT %4.

2)  srcelwidth != destelwidth

this is where the Monster FSM with the 256+ bit Shift Register comes into its own.  here it's *not* ok to drop a massive set of gates down without some Serious Thought going into it.

the idea here is to basically have some sort of subset of shift/mask that can cope with and adapt to varying pixel in/out and audio in/out patterns, but *without* having to design a full suite of operations that basically duplicates the entire OpenPOWER rotate ISA.


but, that aside: yes if (2) doesn't work out then the fallback option would be to at least get the data widened/expanded, use parallel vector shift/mask (same src/dest width)
Comment 45 cand 2020-12-14 18:39:20 GMT
> example: from LBS0 of input:
> 
> * skip N bits select M bits place in dest1
> * skip O bits select P bits place in dest2

I can't think of a good use for this capability. That's approaching office document parsing, highly complex and in little need of vector speedup.

> it requires a lookup table.  hmm reluctant to include that. 

The Huffman tree is data-dependent, but not necessarily a big table. Canonical Huffman can be encoded with just 4*maxlen aux data. If you were encoding bytes, 4*16*2 = 128 bits of aux data would let you do that. Canonical decoding has similar low requirements.

A huffman encoding instruction wouldn't be that useful though. Media uses more advanced methods, varying by codec. POWER has its gzip accelerator that packs and unpacks with ludicrous speed, but it's gzip, not huffman.

> if however you are saying that even trying general purpose pixel-extraction is 
> not worth it (to support arbitrary future pixel encodings like 10 10 10, or... 
> i don't know) then i am happy to drop this.

Useful future formats are likely to be same-width, like 10-10-10. My scheme would handle those. It's the per-element shift-mask, for different-width elements, that may not have much use.

If it's easy for you to implement and doesn't take much hw space, the cost is just one more opcode. I'm not against it per se, I just see little use for it.
Comment 46 Jacob Lifshay 2020-12-14 18:43:37 GMT
Another note, there are highly likely to be pixel decoding/encoding load/stores since those are needed for GPU framebuffer read/write, texture read, and format interconversion, see the Vulkan spec for the full list:
https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/chap43.html#features-required-format-support

We are also planning on supporting planar and packed YUV formats.
Comment 47 Jacob Lifshay 2020-12-14 18:44:58 GMT
For texture decoding, we need to be able to dynamically select at run-time between formats.
Comment 48 Jacob Lifshay 2020-12-14 18:46:21 GMT
Note that 1555 and 565 are both in the list of formats required by Vulkan.
Comment 49 Luke Kenneth Casson Leighton 2020-12-14 19:28:04 GMT
(In reply to Jacob Lifshay from comment #46)
> Another note, there are highly likely to be pixel decoding/encoding
> load/stores

eek! really? :)  well... like in the oringinal POWER1 i don't mind too much if that's done by micro-coding (LD goung straight into this new SR FSM), although honestly the thought of fitting both LDST semantics and encoding schemes for pixels *and* getting that into SV does not fill me with great glee :)

if they can be kept separate without there being a huge penalty for needing to use large chunks of the regfile for the raw data that would be a relief.

> since those are needed for GPU framebuffer read/write, texture
> read, and format interconversion, see the Vulkan spec for the full list:
> https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/chap43.
> html#features-required-format-support
> 
> We are also planning on supporting planar and packed YUV formats.

okaay.  so that's a hell of a long list, including 4-bit RGB.  i think this is where the shift-range-extract idea was stemming from: being able to get at 4-4-4 or 5-6-5 or 2-10-10-10 and throw them into  the individual elements of a vec3 or vec4 (and back again).
Comment 50 cand 2020-12-15 07:40:45 GMT
> Note that 1555 and 565 are both in the list of formats required by Vulkan.

I think you misread the tables, like I did at first. Only the ones with a tickmark, cross or double-cross are required, the formats with nothing are optional. So 5-6-5 is mandatory, 5-5-5-1 not.

> 2-10-10-10

This does not require the variable widths; as you use a 32b register and shift out 10*3 bits, the last two are naturally handled.

I suppose the question now is whether 5-6-5 (and possible other different-width formats required) should be hardcoded or used through the variable-width instr.
Comment 51 cand 2020-12-15 08:00:27 GMT
Since we only do up to vec4, the variable widths can be encoded in a 16-bit immediate, 4-4-4-4, if limited to <= 16. Alternatively a 32-bit immediate would allow the full 64 range.
Comment 52 Jacob Lifshay 2020-12-15 08:01:40 GMT
(In reply to cand from comment #50)
> > Note that 1555 and 565 are both in the list of formats required by Vulkan.
> 
> I think you misread the tables, like I did at first. Only the ones with a
> tickmark, cross or double-cross are required, the formats with nothing are
> optional. So 5-6-5 is mandatory, 5-5-5-1 not.

VK_FORMAT_A1R5G5B5_UNORM_PACK16 is mandatory. I just re-checked.

> > 2-10-10-10
> 
> This does not require the variable widths; as you use a 32b register and
> shift out 10*3 bits, the last two are naturally handled.
> 
> I suppose the question now is whether 5-6-5 (and possible other
> different-width formats required) should be hardcoded or used through the
> variable-width instr.

Since we need formats to be dynamically-selectable (effectively required by Vulkan since we want to avoid recompiling shaders or having too many variant shaders) the texture load instruction(s) will need to support all the different formats, so, we might as well hard-wire the decoders for those formats at least for the texture load instruction, maybe also for framebuffer read/write and copy ops too. Those can probably share instructions.

So, I think variable shifts will be needed for other stuff, but we should just use the dedicated texture read instructions here.
Comment 53 cand 2020-12-15 08:25:07 GMT
Oh, they have 5-5-5-1 and 1-5-5-5 separately listed. My bad.
Comment 54 Luke Kenneth Casson Leighton 2020-12-15 14:30:43 GMT
reply moved to bug #546
Comment 55 Luke Kenneth Casson Leighton 2020-12-15 14:35:46 GMT
(In reply to cand from comment #51)
> Since we only do up to vec4, the variable widths can be encoded in a 16-bit
> immediate, 4-4-4-4, if limited to <= 16. Alternatively a 32-bit immediate
> would allow the full 64 range.

ok i am glad you said this: we don't have space in the ISA for 16 bit immediates.  6 bits EXTNNN, 16 bits imm that's 22.  only 10 bits remain that's enough for 2x 5bit regs, but an *entire* precious Major Opcode was taken up.

want another op, take another Major Op.

32 bit immediates unfortunately require a 64 bit instruction and there is no (sane) way to include SV on top of that.  or, there might be but i don't want to explore that rabbithole unless we are forced to.
Comment 56 cand 2020-12-15 15:32:56 GMT
The immediate required for same-width is just 5/7 bits. 1 bit for BE/LE, 4 bits for 1-16, 6 for full 64 range.
Comment 57 Luke Kenneth Casson Leighton 2020-12-15 17:08:11 GMT
(In reply to cand from comment #56)
> The immediate required for same-width is just 5/7 bits. 1 bit for BE/LE, 4
> bits for 1-16, 6 for full 64 range.

yeah i think going beyond 1-16 is not sane to try to fit, here.

... click... ah HA!

   rlwimi RA,RS,SH,MB,ME (Rc=0)
   rlwimi. RA,RS,SH,MB,ME (Rc=1)

https://libre-soc.org/openpower/isa/fixedshift/

half of those are 1-in 1-out! that makes them candidates for adding into the Monster Shift Register FSM after all! now we are cooking with gas.

right.

so two things:

1) if the shift-mask schedule for each Vulkan Texture can be encoded into an algorithm, sequentially programming different values of sh, mb and me into the shifters, we have all VK Texture ops covered in micro-coding style.

   (the only complexity here is 12bit
    but this is not insurmountable)

2) if we can work out how to augment the shift-mask schedule by taking the SUBVL index as some sort of augmenter, we can probably cover the majority of pixel/audio formats there.

methods include:

2a) macro op fusion.  R gets its own rlwimi, G gets its own rlwimi, B likewise, A likewise

    (that's a big macro op sequence though)

2b) an SPR specifying the augmentation schedule

    (an SPR as input to the Monster FSM
     is kinda ok)

2c) somehow polluting or giving alternative meaning to the 24 SV prefix bits.

    (you can tell i don't like this one
     it means the SV Prefix Decoder needs
     to know the operation)

2d) an algorithmic augmentation which otherwise keeps the exact same sh, mb, me constants
 

    for j in range(SUBVL):
        n <- SH * (j*something)
        r <- ROTL32((RS)[32:63], n)
        m <- MASK(MB+32, ME+32)
        RA <- r&m | (RA) & ¬m

sonething that offsets sh, mb and me by some amount.

OR

2e) because we know that the pixels are in 16 bit (1555) then from the elwidth we *know* that high bits of sh, mb and me will not be used.  it *might* be possible to use those to provide the "augmentation"

    (i like this route although it will
     take time to analyse. sigh)
Comment 58 Luke Kenneth Casson Leighton 2020-12-19 17:37:32 GMT
(In reply to cand from comment #12)
> Commented on the horizontal mapreduce.

i got it.  i gotitigotit.

ok so if it is acceptable to use a destination *vector* for the mapreduce, then all elements above 0 may be used to compute and store intermediary results.

this gets rid of the huge problem associated with performing large numbers of non-reentrant iterations, discarding work that was half way completed just because of an interrupt.

also initially i thought it would be a problem to store sufficient state to manage this, i realised that in the STATE SPR there are *two* offsets, srcoffs destoffs, the 2nd normally being used for twin predication.

of course, 2pred can't have mapreduce done on it because it is 1 src 1 dest.

therefore destoffs is free for use.

ta-daaa :)
Comment 59 Luke Kenneth Casson Leighton 2020-12-19 18:16:50 GMT
question: there are two modes for mapreduce.  RA==RB, and RA!=RB.

RA==RB can be taken to mean

   RT=add(RA[0], add(RA[1], ..,  add(RA[VL-2], RA[VL-1]))))

but what about when RA!=RB?
Comment 60 cand 2020-12-20 07:16:06 GMT
The most useful interpretation would be for RA to be the giant vector, and RB to be a single vec4 (or whatever size is used).
Comment 61 Luke Kenneth Casson Leighton 2020-12-20 12:49:15 GMT
(In reply to cand from comment #60)
> The most useful interpretation would be for RA to be the giant vector, and
> RB to be a single vec4 (or whatever size is used).

this might be the case already (without needing to add any extra instructions) here's what the regfiles, conceptually, look like:

    typedef union {
        uint8_t  b[8];
        uint16_t s[4];
        uint32_t i[2];
        uint64_t l[1];
    } reg_t;

    // integer table: assume maximum SV 7-bit regfile size
    reg_t int_regfile[128];

let us say that RA is set to elwidth=32 and RB is set to elwidth=8.
then this will be the VL loop:

    for i in range(VL):
         int_regfile[RA].i[i] = some_op(int_regfile[RB].b[i])

where some_op may involve clamping/saturation.  you see here that
elwidth=8 refers to the "b" (byte) array of the union, and that elwidth=32
refers to the "i" (word) array of the union?

the result would be:

RA+0 32 bit:   RB+0 byte     0x00     0x00      0x00
RA+1 32 bit:   RB+1 byte     0x00     0x00      0x00
RA+2 32 bit:   RB+2 byte     0x00     0x00      0x00



where vec4 is involved it's a little more like this (RA elwidth=32,
RB elwidth=8):

    for i in range(VL):
         if predicate_bit_not_set(i) continue
         uint8_t *start_point = (uint8_t*)(int_regfile[RA].i[i])
         for j in range(SUBVL): # vec4
              start_point[j] = some_op(int_regfile[RB].b[i*SUBVL + j])

this would produce:

RA+0 32 bit:   RB+0 byte     RB+1 byte     RB+2 byte   RB+3 byte
RA+1 32 bit:   RB+4 byte     RB+5 byte     RB+6 byte   RB+4 byte

where each RB+N byte would have been run through "some_op()".


if that were a byte-array-vec3 into a word-array, you end up with the 4th byte zero'd out in the 32-bit RA destination:

RA+0 32 bit:   RB+0 byte     RB+1 byte     RB+2 byte   0x00
RA+1 32 bit:   RB+3 byte     RB+4 byte     RB+5 byte   0x00
RA+2 32 bit:   RB+6 byte     RB+7 byte     RB+8 byte   0x00




is that clear at all?  if not i can write it out or find... i'm sure
i've done this before... ah! here:

   https://libre-soc.org/simple_v_extension/appendix/#load_example

although that's a "LOAD" it's the same principle (because we are
referring to MVs, which are also twin-predicated and have both
a src elwidth over-ride and a dest elwidth override).
Comment 62 cand 2020-12-20 17:28:03 GMT
That's the normal SV setup, while we were discussing horizontal add. For the horz-add case, VL would be big for RA and small for RB.
Comment 63 cand 2020-12-20 17:28:23 GMT
And RT is the dest usually, not RA?
Comment 64 Luke Kenneth Casson Leighton 2020-12-20 20:09:45 GMT
(In reply to cand from comment #62)
> That's the normal SV setup, while we were discussing horizontal add. For the
> horz-add case, VL would be big for RA and small for RB.

.... *click*.  you mean map-reduce.  the word "horizontal" confused me, different term.  got it.  yes, sorry, i completely lost the context, sorry.  (small request: could you use the "reply" button, it will do the usual email ">" thing. i would probably still have got it wrong, with the word "horizontal", but hey... :)

ok so let's go over this.  firstly: VL is the outer for-loop, it doesn't change for RA, RB or RT.  did you mean "big elwidth for RA, small elwidth for RB" instead?

to cover vec_msum, i've added a "reduce" mode to SV.  this:

     mul RT, RA, RB reducemode,VL=3

would result in:

     RT = (RA[0]*RB[0]) * (RA[1]*RB[1]) * (RA[2]*RB[2])

but... you're looking for a *different* elwidth on RA from RB, is that right?
if so, that's.... yeah, very tough to fit into the SV paradigm, because
elwidths apply in "arithmetic" cases to the whole operation, and in "MV"
cases you have a src elwidth and a dest elwidth.

but, arithmetic operations with different elwidths? this requires three:

* src1 elwidth
* src2 elwidth
* dest elwidth

and that's too much.

it *might* however be ok to jam in that gather-add you suggested:

     gather-add: d = a.x + a.y + a.z + a.w

this will be hair-raising but i think it's doable, based on a bit in the Mode
to say "vec2/3/4 reduce is in effect".

would this work?

     for i in range(VL):
          iregs[RT+i] = 0
          for j in range(SUBVL):
              iregs[RT+i] += iregs[RA+i*SUBVL+j]
Comment 65 cand 2020-12-21 07:49:05 GMT
(In reply to Luke Kenneth Casson Leighton from comment #64)
> but... you're looking for a *different* elwidth on RA from RB, is that right?
> if so, that's.... yeah, very tough to fit into the SV paradigm, because
> elwidths apply in "arithmetic" cases to the whole operation, and in "MV"
> cases you have a src elwidth and a dest elwidth.

Nono, different VL, or rather mod-VL for RB. But that was only "what would be the most useful interpretation for when RA != RB, if it's not worth it, then we can outlaw them being different in reduce mode.

RA=RB
Just gather-add or gather-mul the elements together. Not twice.

RA!=RB
RA is gathered, RB is added/muled on top as a single vec4, not an array of vec4s like RA. If too much trouble, then disallow RA!=RB.

RT+0 = (RA+0 + RA+4 + RA+8 ... RA+(VL-1)) + RB+0
RT+1 = (RA+1 + RA+5 + RA+9 ... RA+(VL-2)) + RB+1
RT+2 = (RA+2 + RA+6 + RA+10 ... RA+(VL-3)) + RB+2
RT+3 = (RA+3 + RA+7 + RA+11 ... RA+(VL-4)) + RB+3

> would this work?
> 
>      for i in range(VL):
>           iregs[RT+i] = 0
>           for j in range(SUBVL):
>               iregs[RT+i] += iregs[RA+i*SUBVL+j]

Yes, looks like it'd result in near the same op as horz-add.
Comment 66 Luke Kenneth Casson Leighton 2020-12-21 13:23:54 GMT
(In reply to cand from comment #65)

> 
> Nono, different VL, or rather mod-VL for RB. But that was only "what would
> be the most useful interpretation for when RA != RB, if it's not worth it,
> then we can outlaw them being different in reduce mode.

VL definitely cannot be different for RA RB or RT, it is a global for-loop.  the closest we can get away with is (and for mv this needed new instructions) allowing one src to be vec2/3/4 and the other to be vec1.
 
> RA=RB
> Just gather-add or gather-mul the elements together. Not twice.

yehyeh.  res = RA[0]+RA[1]...RA[VL-1]

> RA!=RB
> RA is gathered, RB is added/muled on top as a single vec4, not an array of
> vec4s like RA. If too much trouble, then disallow RA!=RB.
> 
> RT+0 = (RA+0 + RA+4 + RA+8 ... RA+(VL-1)) + RB+0
> RT+1 = (RA+1 + RA+5 + RA+9 ... RA+(VL-2)) + RB+1
> RT+2 = (RA+2 + RA+6 + RA+10 ... RA+(VL-3)) + RB+2
> RT+3 = (RA+3 + RA+7 + RA+11 ... RA+(VL-4)) + RB+3

mmm it would be easier just to split this into 2 separate adds.  the followup involves RB.  although, because the result is now a scalar vec4, adding RB.vec4 to RT.vec4 would need an extra instruction to change VL.  however this is probably needed anyway.



> > would this work?
> > 
> >      for i in range(VL):
> >           iregs[RT+i] = 0
> >           for j in range(SUBVL):
> >               iregs[RT+i] += iregs[RA+i*SUBVL+j]
> 
> Yes, looks like it'd result in near the same op as horz-add.

ok great.  it's a general purpose way to express that SIMD-horizontal-add discussed earlier.  horiz-vec3 of course will be a pain but horiz-vec2 and vec4 should fit cleanly.
Comment 67 Luke Kenneth Casson Leighton 2020-12-26 18:41:45 GMT
lauri i found the comment about IEEE754 saturate as an option.  yes it is, however there does actually already exist an FP SPR flag dedicated to "less accuracy".


Float estimates

vec_expte - float 2^x
vec_loge - float log2(x)
vec_re - float 1/x
vec_rsqrte - float 1/sqrt(x)
The spec says the max relative inaccuracy is 1/4096.

In conjunction with the FPSPR "accuracy" bit These could be done by assigning meaning to the "sat mode" SVP64 bits in a FP context. 0b00 is IEEE754 FP, 0b01 is 212 accuracy for FP32. These can be applied to standard scalar FP ops
Comment 68 Luke Kenneth Casson Leighton 2021-01-18 15:22:28 GMT
i added a summary at the top of the wiki page, fascinatingly it is pretty minimal.

also i noted that FP estimates can be done by changing the meaning of "single precision" v3.0B to be "half of bitwidth precision".  then when elwidth=32 using single^H^H^H^H^H half precision operations these are effectively FP "estimates" because FP16 accuracy operations are stored in FP32 format.