option 1: expand svoffset and maybe add instruction for dynamically loading from reg: instruction for dynamically loading offset should have an option for setting offset = RA >> 6 or -(RA >> 6) (negate after shift), needed for bigint shifts. there should also be a plain offset = small_imm +- (RA) option. proposed pseudo-code: v = (RA) if sh6: v >>= 6 # RA counts bits to shift, we need 64-bit words if negate: v = -v # must happen after shift offset = v + small_imm # small_imm could be like 5-bit signed int if shrink_vl: VL -= abs(v) # needed to avoid reading/writing outside vector option 2: use twin-predication with special masks -- likely slow or needs extra hardware to recognize those masks. https://bugs.libre-soc.org/show_bug.cgi?id=973#c2
bugreports like this need to be in the form of software engineering style "requirements", rather than going straight to "this is how it's going to be done". "a problem is identified" rather than deciding - without a discussion - what the solution is. particularly when it's part of the spec, we need to *have* the discussion and evaluate all options before making a decision, in case there are questions during review. bigint shifting is the canonical (obvious) case where moving registers up and down dynamically is needed. the easiest way that's achieved right now is with twin-predication, setting only source or dest mask. source mask with zeros at the front would result in "registers shifted right" (down). destination mask with zeros at the front is "registers shifted left" (up). an instruction which makes it easy to create masks from integers is therefore a strong candidate (one that was already under consideration in bitmanip, and this is a good use-case for it)
(In reply to Luke Kenneth Casson Leighton from comment #1) > the easiest way that's achieved > right now is with twin-predication, setting only source or dest mask. The problem with using twin-predication, is that we would need special hardware to recognize those special masks and to optimize it to use slideup/down hardware (where elements can be grouped into large chunks), rather than general permutation hardware (where each element is processed independently -- likely just as slow as svindex). > source mask with zeros at the front would result in "registers shifted > right" (down). destination mask with zeros at the front is "registers > shifted left" (up). an instruction which makes it easy to create > masks from integers is therefore a strong candidate (one that was > already under consideration in bitmanip, and this is a good use-case > for it) well, we already have instructions that would work relatively well for setting masks: # r5 is shift amount in bits, r32..32+VL is little-endian bigint to shift srdi r4, r5, 6 # convert from bits to 64-bit words # bitmanip op would only replace li/sld pair, rest of insns are still needed li r3, -1 sld r3, r3, r4 # initialize carry in to sign bit, use li r6, 0 for unsigned shift sv.sradi r6, r32, 63 sv.dsrd/mrr *r16, *r32, r5, r6 # copy back to r32..32+VL shifting right by r4 words sv.addi/srcmask=r3 *r32, *r16, 0 # final result is in r32..32+VL-r4 for shift left, just set destmask=r3 instead of srcmask=r3. the copy would need to be preceded by clearing r32..
(In reply to Jacob Lifshay from comment #2) > The problem with using twin-predication, is that we would need special > hardware to recognize those special masks and to optimize it to use > slideup/down hardware (where elements can be grouped into large chunks), ... and? that is neither a problem nor is it "our" problem. it is exactly the kind of thing that goes into "Architectural Notes", advising implementors to do precisely and exactly that, should they want high performance. failure to provide optimal hardware is not a good reason to either limit or add further complexity to an ISA. SIMD is at the total opposite spectrum of Scalable Vector ISAs: the entire premise of Scalable Vectors is precisely that hardware is more intelligent, where SIMD imposes internal microarchitecture at the programmer. but we cannot keep adding more complexity to the REMAP side of the ISA, it is already causing me some concern. > rather than general permutation hardware (where each element is processed > independently -- likely just as slow as svindex). cyc!ic buffer. discussed and described many times at least 18 months ago. introduces latency but is far lower cost. svindex is not "automatically and inherently slow", i have no idea why you preserve that false notion in your mind. it requires extra reads (cached) on the GPR file, that does not *automatically* make it "slow" under all possible circumstances and all possible implementations. > li r3, -1 > sld r3, r3, r4 hmm hmm i thought it would be more complex than that. the plan/thought was to have general-purpose (RA, RB) direct access to MASK(), able to specify the start as well as end. above is equivalent (i think?) to MASK(0, 63-r4) is that right?
turns out bigint divmod using knuth's algorithm d most likely needs dynamic svoffset too, since the outer loop offsets where some of the stuff in the inner loop is read/written. see bug #1044 comment #35 simplified loop is like: for j in ...: qhat = divrem(un[j], vn[n - 1]) qhat = fixup(qhat, un[j - 1], vn[n - 2]) for i in ...: un[i + j] -= vn[i] * qhat # plus some carry stuff if borrowed: qhat -= 1 for i in ...: un[i + j] += vn[i] # plus some carry stuff q[j] = qhat
(In reply to Jacob Lifshay from comment #4) > turns out bigint divmod using knuth's algorithm d most likely needs dynamic > svoffset too, use 1<<r3 predicate mask with a vector reg or in Vertical-First use a scalar reg sv.xxxx. those are the two main ways to do dynamic single-reg targetting. l.
(In reply to Luke Kenneth Casson Leighton from comment #5) > (In reply to Jacob Lifshay from comment #4) > > turns out bigint divmod using knuth's algorithm d most likely needs dynamic > > svoffset too, > > use 1<<r3 predicate mask with a vector reg > or in Vertical-First use a scalar reg sv.xxxx. > > those are the two main ways to do dynamic single-reg > targetting. that's fine for scalar, but bigint divmod most likely needs something like the following: for j in ...: # more stuff in j loop # j is dynamic and would be nice to use svoffset for i in range(VL): prod[i] = maddedu(vn[i], qhat, ...) for i in range(VL): # different inputs use/don't-use j, so even # twin-predication can't do this: un[i + j] = subfe(prod[i], un[i + j]) # more stuff in j loop