several different types of Reduce Modes need to be defined and written up * Scalar * Tree-reduction * Sub-Vector Horizontal
see https://bugs.libre-soc.org/show_bug.cgi?id=213#c111
depending on how it goes, we may want to reverse the order of the tree-reduction to match what rust might define as the canonical tree-reduction order -- it seems to work better on arm. https://github.com/rust-lang/portable-simd/issues/235#issuecomment-1027331064
there is already a "reverse gear" bit in SVP64
(In reply to Luke Kenneth Casson Leighton from comment #3) > there is already a "reverse gear" bit in SVP64 i don't think that does what we might need here... reversed means it reduces like so (@ are adds; this one is more efficient on arm): 0 1 2 3 4 5 6 7 | | | | | | | | @----|----|----|----/ | | | | @----|----|---------/ | | distance=4 | | @----|--------------/ | | | | @-------------------/ | | | | @----|----/ | distance=2 | @---------/ | | @----/ distance=1 | the original (current SimpleV) reduce goes distance=1,2,4... 0 1 2 3 4 5 6 7 | | | | | | | | @----/ @----/ @----/ @----/ distance=1 | . | . | . | @---------/ . @---------/ distance=2 | . . . | @-------------------/ distance=4 |
additional conversation here: https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd/topic/pairwise.20ops.20would.20be.20nice/near/270315967
public archive: https://zulip-archive.rust-lang.org/stream/257879-project-portable-simd/topic/pairwise.20ops.20would.20be.20nice.html#270315967
me on github: > turns out Arm's faddv instruction which does reduction uses a pattern like > the current SimpleV and what I originally proposed in this thread > (distance=1,2,4 from diagram) -- except that it only matches when the > vector length is a power of 2.
(In reply to Jacob Lifshay from comment #4) > (In reply to Luke Kenneth Casson Leighton from comment #3) > > there is already a "reverse gear" bit in SVP64 > > i don't think that does what we might need here... the bit is available so can be made to mean anything that is necessary or needed. https://libre-soc.org/openpower/sv/normal/ 0-1 2 3 4 description 00 0 dz sz normal mode 00 1 0 RG scalar reduce mode (mapreduce), SUBVL=1 00 1 1 / parallel reduce mode (mapreduce), SUBVL=1 so there's one bit spare which can still be made available for a parallel-reduce reverse-gear mode > reversed means it reduces like so (@ are adds; > this one is more efficient on arm): the scalar-reduce is a misnomer, it basically - note the careful/clumsy negation - "prevents the vector-loop from stopping just because the result is a scalar" normally when a result is a scalar, the vector-loop terminates at the very first element (because it's a scalar, duh). however in scalar-reduce mode the looping does *not* stop, and so you can do this: ADD r3.s, r3.s, r10,v and that will do: for i in range(VL): iregs[3+i] = iregs[3+i] + iregs[10+i] the scalar-reduce bit tells the hardware-loop *NOT* to stop just because there has been one result dropped into r3 already. something like this would cover both non-reduce and reduce: for i in range(VL): iregs[3+i] = iregs[3+i] + iregs[10+i] if not scalar_reduce_mode: if RT is scalar: break the other trick is partial sum, by specifying the same register offset by one in src and dest you get the *effect* of a reduction: ADD r4.v, r3.v, r10,v for i in range(VL): iregs[4+i] = iregs[3+i] + iregs[10+i] thus you end up with *partial results* (partial sum) and, oh look, the last element happens to be the full reduced sum. reverse-gear is *needed* here because you might want the result to be at the start. for i in reverse(range(VL)): iregs[4+i] = iregs[3+i] + iregs[10+i] also you might want the reverse-gear even when the result is a scalar because by starting at the opposite end of the results, FMACs (or FADDs) may end up producing different results depending on the accuracy *and* order of the numbers being operated on. note that the pseudocode above says how things must *look* as far as *programmers* are concerned. it does *NOT* dictate how the hardware is actually implemented, just that it must have the same net effect *as* the pseudocode. if hardware implementors discover that the above pseudo-code algorithms can be implemented as parallel tree-reduce *great* [but what they can't do is implement a hardware algorithm that, under certain circumstances, produces a different result from that which would be produced by the pseudo-code] so that's scalar-reduce. moving on to parallel-reduce... here's the algorithm (bottom of the page): https://libre-soc.org/openpower/sv/svp64/appendix/ with that bit spare (bit 4) it can be allocated for anything-that-is-needed reverse-gear, reverse-mapping, or banana-fishing. something useful preferably. my concern with that algorithm is that it critically relies on creating state (the vi nonpredicated array) where there's actually no space to put that state if there is an interrupt to be serviced in the middle of the reduction operation (all SVP64 operations are currently designed and required to be re-entrant) with vi being indices, that's a hell of a lot of state. HOWEVER i *believe* it may be possible to shoe-horn this algorithm into the SVREMAP system. jacob could you please try morphing the algorithm into something that looks pretty much exactly like this: def reduction_yielder_for_RA(): for something in something: yield offset_idx_for_RA def reduction_yielder_for_RB(): for something in something: yield offset_idx_for_RB def reduction_yielder_for_RT(): for something in something: yield offset_idx_for_RT for i in range(some_function_of(VL)): RA_offs = yield reduction_yielder_for_RA() RB_offs = yield reduction_yielder_for_RB() RT_offs = yield reduction_yielder_for_RT() regs[RT+RT_offs] = regs[RA+RA_offs] + regs[RB+RB_offs] note that that is quite literally the standard SVP64 vector loop, but oh look! there happen to be some offsets added, how did those get there? :) the separation of the offsets from the computation itself will fit directly into the SVREMAP system, and allow the front-end multi-issue engine to run (independently) creating multiple sequences of offsets in one clock cycle, to hit the back-end parallel SIMD ALUs with as many operations as they can handle. SVREMAP *can* store state (and is saved on an interrupt context-switch) so you *can* do analysis of the predicate bits and do-something-else-because-of-them(). but please please note jacob that what we *cannot* have is *two* different operations (unless one of them can be "faked-up" by one of the operands being zero or one) what we *cannot* have is: for i in range(some_function_of(VL)): RA_offs = yield reduction_yielder_for_RA() RB_offs = yield reduction_yielder_for_RB() RT_offs = yield reduction_yielder_for_RT() if (some_condition) regs[RT+RT_offs] = regs[RA+RA_offs] + regs[RB+RB_offs] else regs[RT+RT_offs] = regs[RA+RA_offs] the *only* way that could conceivably be done - and even this is a bit iffy: if (some_condition) src2 = regs[RB+RB_offs] else src2 = 0 regs[RT+RT_offs] = regs[RA+RA_offs] + src2 i.e. use zeroing (which would have to be assumed/implied, because there's just no room in the SV RM Mode table for another bit) it would still be iffy for multiply-reduce because the definition of zeroing is, well, to put a zero in, not a one.
(In reply to Jacob Lifshay from comment #5) > additional conversation here: > https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd/ > topic/pairwise.20ops.20would.20be.20nice/near/270315967 please do post relevant parts of that into here, or come up with a way for the discussion there to be automatically archived under Libre-SOC resources, in order to satisfy the Transparency obligations we're under. likewise anything on "github" or other external resource not under our control [or responsibility]
def reduce( vl, vec, pred, pred,): j = 0 vi = [] # array of lookup indices to skip nonpredicated for i, pbit in enumerate(pred): if pbit: vi[j] = i j += 1 step = 2 while step <= vl halfstep = step // 2 for i in (0..vl).step_by(step) other = vi[i + halfstep] i = vi[i] other_pred = other < vl && pred[other] if pred[i] && other_pred vec[i] += vec[other] pred[i] |= other_pred step *= 2 would turn into something like this: def i_yielder(....) j = 0 vi = [] # array of lookup indices to skip nonpredicated for i, pbit in enumerate(pred): if pbit: vi[j] = i j += 1 step = 2 while step <= vl halfstep = step // 2 for i in (0..vl).step_by(step) other = vi[i + halfstep] i = vi[i] other_pred = other < vl && pred[other] if pred[i] && other_pred yield i pred[i] |= other_pred step *= 2 def other_yielder(....) j = 0 vi = [] # array of lookup indices to skip nonpredicated for i, pbit in enumerate(pred): if pbit: vi[j] = i j += 1 step = 2 while step <= vl halfstep = step // 2 for i in (0..vl).step_by(step) other = vi[i + halfstep] i = vi[i] other_pred = other < vl && pred[other] if pred[i] && other_pred yield other pred[i] |= other_pred step *= 2 and now that fits the SVREMAP pattern for i in range(some_function_of(VL)): RA_offs = yield reduction_yielder_for_RA() # other RB_offs = yield reduction_yielder_for_RB() # i RT_offs = yield reduction_yielder_for_RT() # also i regs[RT+RT_offs] = regs[RA+RA_offs] + regs[RB+RB_offs] except for the bit about pred[i] |= other_pred which i will leave you to work out. bear in mind that it's fine to be [much] slower restarting in the middle of a context-switch. so if the pred[i] has to be stored in throw-away state which is re-computed, that's perfectly fine. [where overloading some arbitrary register or adding yet another SPR to store it is not fine] (matrix-multiply REMAP has to take a long time to work out where it was interrupted, for non-power-two, because you can't exactly do a cascading set of MOD/DIV operations in one clock cycle) also bear in mind that in Rc=1 mode if the predicate being relied on by the algorithm is overwritten by Rc=1 then you may end up with corrupted results (even without an interrupt in the middle)
the algorithm we'd want to use is based on the algorithm without indirection through the vi table: def reduce(vl, vec, pred): step = 1; while step < vl step *= 2; for i in (0..vl).step_by(step) other = i + step / 2; other_pred = other < vl && pred[other]; if pred[i] && other_pred vec[i] += vec[other]; else if other_pred vec[i] = vec[other]; pred[i] |= other_pred; # scalar result is now in vec[0] idk why the version with vi was added to the wiki, but it destroys most of the benefits of the above version. the above version specifically reduces in a tree pattern where changes in vl/pred don't affect where in the tree add/copy are performed, allowing a cpu to efficiently implement tree reduction without needing a full crossbar on the alu inputs or outputs. when some part of pred is 0, it changes the adds to copies, but doesn't change the position or order of any other operations.
(In reply to Jacob Lifshay from comment #11) > the algorithm we'd want to use is based on the algorithm without indirection > through the vi table: > def reduce(vl, vec, pred): > step = 1; > while step < vl > step *= 2; > for i in (0..vl).step_by(step) > other = i + step / 2; > other_pred = other < vl && pred[other]; > if pred[i] && other_pred > vec[i] += vec[other]; > else if other_pred > vec[i] = vec[other]; > pred[i] |= other_pred; > # scalar result is now in vec[0] > > idk why the version with vi was added to the wiki, because it is not possible to have two completely separate types of operations (one a MV, the other arithmetic), and because internal state is not permitted (or practical) except that which will fit into SVREMAP SPRs. if it is wrong, and you want this in SVP64, it is your responsibility to work through the full implementation details. > but it destroys most of > the benefits of the above version. the above version specifically reduces in > a tree pattern where changes in vl/pred don't affect where in the tree > add/copy are performed, allowing a cpu to efficiently implement tree > reduction without needing a full crossbar on the alu inputs or outputs. when > some part of pred is 0, it changes the adds to copies, but doesn't change > the position or order of any other operations. "changing adds to MVs" is not permitted. changing the input parameter on an add so that is in *equivalent* to a MV is fine [edit: actually it isn't] however imposing on the decoder to drop 1 (or 1.0) into the appropriate operation, to make any given operation a MV, this is pretty dicey especially when it comes to reduce on CR operations. therefore, on balance, it's not happening (too complex) an algorithm which tracks where the operand is and therefore *does not need the MV at all* would on the other hand be perfectly fine, as long as, again, it fits into SVREMAP. (yes, really, the DCT REMAP does some incredibly strange-looking non-linear indices remapping, don't ask me to explain it here, it took me 6+ weeks to work out) so, please can you rewrite the algorithm in a form that uses yield iterators and does not require a MV. yes this means having a special index array that allows de-referencing of non-masked elements, and i suspect that will be perfectly fine. will post separately with the basic principle
def reduce(vl, vec, pred): step = 1; while step < vl step *= 2; for i in (0..vl).step_by(step) other = i + step / 2; other_pred = other < vl && pred[other]; if pred[i] && other_pred vec[i] += vec[other]; else if other_pred vec[i] = vec[other]; pred[i] |= other_pred; transform ==> def reduce(vl, vec, pred): step = 1; indices =[] while step < vl step *= 2; for i in (0..vl).step_by(step) other = i + step / 2; other_pred = other < vl && pred[other]; if pred[i] && other_pred indices[something] = other else if other_pred indices[something] = other pred[i] |= other_pred; step = 1; while step < vl step *= 2; for i in (0..vl).step_by(step) other = i + step / 2; vec[i] += vec[indices[somethingother]]; and that to then be further morphed to a REMAP yield form
followup: now i think about it, index-redirection will be needed on *both* i *and* other, such that reg[redirect1[i]] += reg[redirect2[other]] and then, obviously, the yield on RT to return redirect1[i] NOT i, and the yield on RA to return redirect2[other] NOT other the first stage of the remaps will therefore compute the tables redirect1 and redirect2 such that the MV operation is entirely redundant. one critically important reason for this is Vertical-First Mode [yes parallel reduce will work in Vertical-First Mode!!!] if there is a MV operation substitution then any given ADD (or whatever) will SILENTLY be replaced with a MV and the programmer will have absolutely no idea why. yes, really, parallel-reduce in VF mode i can already see some useful things there, involving multiple operations tracking the exact same reduction schedule: doing a divide on the src operand or computing the inputs to the schedule on-the-fly, or even (this would be horrific) exiting the reduction loop early depending on partial results, or "fixing" the values if they happen to overflow. lots of possibilities and they are all actively interfered with if a MV is involved
def reduce_yield(vl, vec, pred): step = 1; while step < vl redirects = list(range(vl) step *= 2; for i in (0..vl).step_by(step) other = i + step / 2; ri = redirects[i] ro = redirects[other] other_pred = other < vl && pred[ro]; if pred[ri] && other_pred yield (ri, ro) else if other_pred redirects[ri] = redirects[ro] pred[ri] |= other_pred; i think you'll find that the above eliminates the need for the MV. it works by having an array of indices which initially start off 0..VL and whenever you would like to *have had* a mv, instead of performing an actual mv, the data is left *in place* but the "redirect" array adjusted dynamically such that a subsequent ADD will take the data from the in-place location there are probably some issues to work through (what if no ADD operations actually take place, what circumstances cause that, and is it really a problem) required state for reentrancy: the predicate and the redirects array. the predicate can be re-read, no problem, as long as Rc=1 hasn't damaged it. the redirects array can be either cached or it can be restored (slowly) by re-running up to SVSTATE.srcstep-1 iterations WITHOUT doing any ADDs at which point redirects should be back to where it should be.
(for people from the portable-simd link, firstly hello! secondly, here is a link explaining Vertical-First Mode https://m.youtube.com/watch?v=fn2KJvWyBKg)
(In reply to Luke Kenneth Casson Leighton from comment #15) > it works by having an array of indices which initially start off 0..VL > and whenever you would like to *have had* a mv, instead of performing > an actual mv, the data is left *in place* but the "redirect" array adjusted > dynamically such that a subsequent ADD will take the data from the > in-place location this btw was exactly how i worked out the in-place DCT algorithm (which is a world-first achievement, to have in-place DCT i.e. without having to use double the number of vector registers) in that case it turned out that the algorithm used for the index remapping was incredibly simple and elegant (idx XOR (idx>>1)) but to get to that stage took a lot of incremental steps (6 weeks worth) and one of those steps is to rewrite the algorithm as a "yield"er i have a sneaking suspicion that even in the case where there is only one ADD needed because all but 2 elements are predicated out, the remapping will still work because the remapping will identify (eventually) the rightmost element. the only case i think not covered would be when *all* elements except *one* are predicated out. this would be expected to perform a straight MV, copying the element into element 0 and that will not happen. it can be "fixed" by adding one extra element (a zero) one unit test that will definitely be needed and quite interesting is when the predicated elements are all in the top half (covered by "other").
(In reply to Jacob Lifshay from comment #4) > 0 1 2 3 4 5 6 7 > | | | | | | | | > @----|----|----|----/ | | | > | @----|----|---------/ | | distance=4 > | | @----|--------------/ | > | | | @-------------------/ > | | | | > @----|----/ | distance=2 > | @---------/ > | | > @----/ distance=1 > | > > the original (current SimpleV) reduce goes distance=1,2,4... > > 0 1 2 3 4 5 6 7 > | | | | | | | | > @----/ @----/ @----/ @----/ distance=1 > | . | . | . | > @---------/ . @---------/ distance=2 > | . . . | > @-------------------/ distance=4 > | once converted to REMAP Form there are plenty of bits available that will not only allow the step to go 1 2 4 8 instead of 8 4 2 1 but also to invert the order of the inner loop as well, such that the reduction result ends up in the *last* element not the first. (again this is already done in DCT/FFT REMAP, for iDCT and iFFT) the current SVP64 RM Mode fields can then become a shortcut for setting up the most popular reduction REMAPs
I think what happened is the discussion on Zulip mostly stalled without us deciding on anything other than something like: fast reproducible reductions would be nice, serial reductions are terrible
the crucial thing that's needed, to make this algorithm work, is that it musn't include or depend on a "MV" instruction (even as a micro-coded op). the trick that i did in FFT and DCT there to avoid the need for MVs was to have an array-of-indices where the *indices* were altered (swapped around) if you needed to perform a MV. the actual "MV" did *not* actually occur but instead because the index had been altered, the next ADD/MUL/WHATEVER would use the [index-redirected] element as one of its srces. it will be... hair-raising but not insurmountable. the only "problem" being that if there is only one element (somewhere half-way down the array), then no ADD/MUL/WHATEVER gets applied, and consequently the result is effectively invalid because no data actually got "reduced". if a MV was used then obviously that one element would be MVed down into the lowest-element spot in the result.
imho moving is absolutely necessary if you want to have tree-reductions run quickly without needing a full-crossbar somewhere on the ALU output -> other ALU input path. here, i define running quickly to mean not needing to delay extra clock cycles because you're running register values through a separate slow-but-general lane-crossing mechanism. basically, a tree-reduction only needs to move data in a few set inter-lane paths, but if you skip moving, then you need to be able to read from any high-index element when combining with lower index elements. e.g. if you have registers organized so every 8th element is matched with each ALU, then tree-reduction with moving allows getting away with only having the fast inter-lane paths for a few paths: needed paths with moving (assuming reducing into r0): ALU lane: 0 1 2 3 4 5 6 7 | | | | | | | | r0.el0/r4.el0/r8.el0/...--X--|--|--|--|--|--|--| r0.el1/r4.el1/r8.el1/...--X--X--|--|--|--|--|--| r1.el0/r5.el0/r9.el0/...--X--|--X--|--|--|--|--| r1.el1/r5.el1/r9.el1/...--|--|--X--X--|--|--|--| r2.el0/r6.el0/r10.el0/...-X--|--|--|--X--|--|--| r2.el1/r6.el1/r10.el1/...-|--|--|--|--X--X--|--| r3.el0/r7.el0/r11.el0/...-|--|--|--|--X--|--X--| r3.el1/r7.el1/r11.el1/...-|--|--|--|--|--|--X--X needed paths with moving (reducing into any reg): ALU lane: 0 1 2 3 4 5 6 7 | | | | | | | | r0.el0/r4.el0/r8.el0/...--X--|--|--|--X--|--X--| r0.el1/r4.el1/r8.el1/...--X--X--|--|--|--|--|--| r1.el0/r5.el0/r9.el0/...--X--|--X--|--|--|--X--| r1.el1/r5.el1/r9.el1/...--|--|--X--X--|--|--|--| r2.el0/r6.el0/r10.el0/...-X--|--X--|--X--|--|--| r2.el1/r6.el1/r10.el1/...-|--|--|--|--X--X--|--| r3.el0/r7.el0/r11.el0/...-|--|--X--|--X--|--X--| r3.el1/r7.el1/r11.el1/...-|--|--|--|--|--|--X--X needed paths without moving and remapping instead (assuming reducing into r0): ALU lane: 0 1 2 3 4 5 6 7 | | | | | | | | r0.el0/r4.el0/r8.el0/...--X--X--X--X--X--X--X--X r0.el1/r4.el1/r8.el1/...--X--X--X--X--X--X--X--X r1.el0/r5.el0/r9.el0/...--X--X--X--X--X--X--X--X r1.el1/r5.el1/r9.el1/...--X--X--X--X--X--X--X--X r2.el0/r6.el0/r10.el0/...-X--X--X--X--X--X--X--X r2.el1/r6.el1/r10.el1/...-X--X--X--X--X--X--X--X r3.el0/r7.el0/r11.el0/...-X--X--X--X--X--X--X--X r3.el1/r7.el1/r11.el1/...-X--X--X--X--X--X--X--X the top right triangle of the remapping crossbar is needed for reductions with more than 8 elements, since they start needing to read from registers r4..r7 and r8..r11 and ...
(In reply to Jacob Lifshay from comment #21) > imho moving is absolutely necessary if you want to have tree-reductions run > quickly without needing a full-crossbar somewhere on the ALU output -> other > ALU input path. not the ISA-design-level's problem and therefore invalid. > here, i define running quickly to mean not needing to delay extra clock > cycles because you're running register values through a separate > slow-but-general lane-crossing mechanism. this is an architectural designers decision that should in absolutely no way cause us with "ISA Design Hats" on to base our decisions on. > basically, a tree-reduction only needs to move data in a few set inter-lane > paths, but if you skip moving, then you need to be able to read from any > high-index element when combining with lower index elements. you are conflating architectural internals with the responsibility of ISA designers. the architectural designer may choose to spot the patterns of ADDs (or whatever) and insert the prerequisite MVs as micro-ops *at their discretion* a FSM architecture (TestIssuer) has no problems of any kind another architectural designer may have 12R8W register files and therefore have no lanes of any kind. another designer may have cyclic shift buffers which substitute for full crossbars. etc etc etc etc. none of these architectural decisions have anything *at all* to do with ISA-level design except inasmuch that they all have to be considered (and to some extent "implementation advice hints" given) second, bear in mind, that the schedule of ops is entirely deterministic based on having read the predicate mask. ops may be issued (scheduled) and analysed long before actual execution takes place. no MVs.
(In reply to Luke Kenneth Casson Leighton from comment #22) > (In reply to Jacob Lifshay from comment #21) > > imho moving is absolutely necessary if you want to have tree-reductions run > > quickly without needing a full-crossbar somewhere on the ALU output -> other > > ALU input path. > > not the ISA-design-level's problem and therefore invalid. > > > here, i define running quickly to mean not needing to delay extra clock > > cycles because you're running register values through a separate > > slow-but-general lane-crossing mechanism. > > this is an architectural designers decision that should in absolutely > no way cause us with "ISA Design Hats" on to base our decisions on. > > > basically, a tree-reduction only needs to move data in a few set inter-lane > > paths, but if you skip moving, then you need to be able to read from any > > high-index element when combining with lower index elements. > > you are conflating architectural internals with the responsibility of ISA > designers. > > the architectural designer may choose to spot the patterns of ADDs (or > whatever) > and insert the prerequisite MVs as micro-ops *at their discretion* > > a FSM architecture (TestIssuer) has no problems of any kind > > another architectural designer may have 12R8W register files > and therefore have no lanes of any kind. > > another designer may have cyclic shift buffers which substitute for > full crossbars. > > etc etc etc etc. > > none of these architectural decisions have anything *at all* to do with > ISA-level design except inasmuch that they all have to be considered > (and to some extent "implementation advice hints" given) well, your argument for *not* having moves is based off of mostly the same things iirc: having alus also be able to do moves is *also* only a microarchitectural decision. > > second, bear in mind, that the schedule of ops is entirely deterministic > based on having read the predicate mask. ops may be issued (scheduled) > and analysed long before actual execution takes place. that's true, it applies to the case with moves too, so isn't a good argument to choose either moves or no-moves. > > no MVs. having moves allows a much simpler algorithm imho, as well as having a consistent place where the reduction result goes (element 0, rather than whatever random spot the remap algorithm happens to leave it). I can work out the HDL details if you like. Would creating a fsm that creates the tree-reduction ops be sufficient for now?
(In reply to Jacob Lifshay from comment #23) > well, your argument for *not* having moves is based off of mostly the same > things iirc: > having alus also be able to do moves is *also* only a microarchitectural > decision. ALUs never do MVs. ok very basic ones would. usually a pseudoop add rt ra 0 > > > > second, bear in mind, that the schedule of ops is entirely deterministic > > based on having read the predicate mask. ops may be issued (scheduled) > > and analysed long before actual execution takes place. > > that's true, it applies to the case with moves too, so isn't a good argument > to choose either moves or no-moves. when combined with Vertical First mode it will melt people's brains, and also complicate ExtraV coherence to have the decision in the separated logic to suddenly do a 2op MV rather than a 3op or 4op ADD/MUL/whatever. too much. > > no MVs. > > having moves allows a much simpler algorithm imho, probably, but where's the fun in that? :) also retrospectively i found that DCT REMAP turned out to use Gray Coding. that was... unexpected and beautiful, and i would never have noticed the simplicity of it if i had thought "lets take the simple route" > as well as having a > consistent place where the reduction result goes (element 0, rather than > whatever random spot the remap algorithm happens to leave it). i know. i worked through the caveats: only a single element (all others predicated out) would be the one that remained invalid. even 2 elements should / would target the correct result-indexed destination element regardless of the positions of 2 active bits of predicate. > I can work out the HDL details if you like. Would creating a fsm that > creates the tree-reduction ops be sufficient for now? given that this will end up being a type of REMAP can i recommend following the path i did there with MMATRIX and FFT DCT which was: * a braindead obvious python demo algorithm * conversion to a yield generator which purely returns indices that * get blatted on top of a scalar op then * integrate the yield generator into ISACaller * and then implement the HDL FSM yes it is a lot of work, it is why MM FFT DCT took me about 8 weeks or more. ISACaller integration will be essential so might as well do incremental. the first priority would therefore be to do the braindead demo. i think the first version i did for MATRIX REMAP didnt even do the mmult itself, just printed out the indices. also i would like to see what the algorithm generates (the indices) to see if it is in fact workable. DCT/FFT REMAP has caveats: power-of-two. no point spending time doing HDL FSM if the algorithm turns out to be borked. need to find that out early. will find links later
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;hb=HEAD the function there, is the earliest REMAP algorithm first implemented, done as a generator. used in ISACaller directly which is a bundle of fun because ISACaller itself uses yield. there is also an actual matrix mult demo/test somewhere.
added python generator for tree-reduce with remap (passed in as a dict or something else indexable) https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=2fe0ce6285864927127d8226171c566886b87e89
(In reply to Jacob Lifshay from comment #26) > added python generator for tree-reduce with remap (passed in as a dict or > something else indexable) > > https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff; > h=2fe0ce6285864927127d8226171c566886b87e89 I got thoroughly distracted trying to make pretty ascii-art tree graphs for use by the above generator, WIP code here: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=f684acfa32ba1ef8c52abc5876da2b73696862dd (lkcl, please don't remove all the type annotations or refactor all the code in text_tree_graph.py, it makes it waay easier for me to figure out, i'll remove them when I'm done...unless you want to finish getting it working? we should probably put this off for several months at least...)
(In reply to Jacob Lifshay from comment #26) > added python generator for tree-reduce with remap (passed in as a dict or > something else indexable) > > https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff; > h=2fe0ce6285864927127d8226171c566886b87e89 ok good. next step is to remove the MV, this is a hard requirement not up for debate. see multiple answers and comments above already given and see comment #15 for the basic direction of the required final algorithm. i thought of a potential solution for the (single, only) case where there is only one predicate bit in the source, and that is to output a co-result (new predicate, likely have to be CRs, or r30) containing an indicator of which elements are valid. in the case where there was more than 1 src bit the co-result would be 000000001 but in the 1-src-bit mask bit the co-result would be an *exact* copy of that very same mask. this solution would only work if the src vector was also the dest vector which is not unreasonable and the final algorithm should take that into account. i have a suspicion it will be a hard requirement anyway: there is no way that an alhorithm which requires a vector of temporary registers will be acceptable or viable: all REMAP algorithms MUST be re-entrant without requiring storage of inflight intermediary computations, only the existing regfile can (theoretically) be used for that. (In reply to Jacob Lifshay from comment #27) > I got thoroughly distracted trying to make pretty ascii-art tree graphs for > use by the above generator, WIP code here: > https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff; > h=f684acfa32ba1ef8c52abc5876da2b73696862dd nice > (lkcl, please don't remove all the type annotations or refactor all the code > in text_tree_graph.py, it makes it waay easier for me to figure out, i'll > remove them when I'm done...unless you want to finish getting it working? we > should probably put this off for several months at least...) no problem with that
drat. when looking at viota i realised that for scalar reduction when predicate masks are involved everything goes to hell. the current definition of scalar reduction is one that disables "stopping" on SV loops if the destination is a scalar because the destination can be used as an accumulator sv.add/mr r0, r3.v, r0 this will use r0 as a scalar accumulator sv.madd/mr r0, r8.v, r16.v, r0 this is basically dotproduct. a prefix-sum (fibonacci) can be done as: sv.addi/mr r1.v, r1.v, r2.v, 1 but - and this is the problem - if predication is enabled on that sv.addi it all goes to hell. first question: can a useful schedule be created which skips over masked-out elements second question: can the registers be easily determined (which one is the accumulator) third question: are there enough bits in MODE to include this.
something along these lines: * RB-RA (the register *numbers* not GPR(RB)-GPR(RA) is computed and stored, call it offs * RT-RA likewise call it roffs * a sequence of offsets is computed from the predicate mask 0b1001101 would create indices [0, 2, 3, 6], call it idxs the for-loop therefore goes: for i in range(len(idxs)-MAX(roffs, offs)): ra = RA + idxs[i] rb = RA + idxs[i+offs] result = OPERATION(GPR(ra), GPR(rb)) rt = RA + idxs[i+roffs] GPR(rt) = result taking twin-predication into account would involve using the dest predicate to create a separate list of indices. if RT != RA then this could be used to e.g. create a vector sequence of differences but e.g. skip certain elements. if RT == RA then as usual for mr and mr/RG (reverse gear) mode this becomes a cumulative sum. if RT<RA then the first few starting elements are skipped to always make sure idxs[] referencing is within range the bit that's conceptually new is that the actual register numbers are computed from the difference to RA as a baseline register.
00 0 dz sz normal mode 00 1 0 RG scalar reduce mode (mapreduce), SUBVL=1 00 1 1 / parallel reduce mode (mapreduce), SUBVL=1 00 1 SVM RG subvector reduce mode, SUBVL>1 add: 00 1 1 0 parallel reduce mode (mapreduce), SUBVL=1 00 1 1 1 scalar relative reduce mode