in store or load operations, it's unclear to me whether(/when?) elwidth (dest or src, respectively) affects the use of the corresponding register operand, and of the scalar or vector references by it.
to keep the question and the answer simpler, I'll make this specifically about the ld instruction and elwidth_src, as bug 570 (it was mostly unrelated to the main theme of that bug, so I guess I should have filed a separate bug to begin with). So consider:
svp64 elwidth_src=<nondefault> ld rD.v, iOff(rS.[vs])
- the register specifying the address to be loaded from can be scalar or vector. it's not clear how the use of the address register and of the memory location/s named in it relate with elwidth_src. I suppose the question could be rephrased as "[when] does the .[vs] (and other source-related parts of the prefix) apply to the register [scalar/vector] proper, or to the memory [scalar/vector] referenced by it/them?"
-- if the load address register is a vector, is it the case that:
--- elwidth_src specifies the address width, and we take consecutive elwidth_src-wide addresses from the address vector, and load full dwords (or elwidth[_dest]-sized objects?) from each such (extended) address? (this appears to be the case for the pseudo-code given under 4.4)
--- or does elwidth_src specify the width of each load, and we take consecutive dword-wide addresses from the address vector for each elwidth_src-wide load?
-- if the load address register is a scalar, is it the case that:
--- elwidth_src specifies the width of each load, and we take consecutive elwidth_src-wide elements starting from the address given by the full address register?
--- or does elwidth_src narrow the address register, and we load full dwords starting at that narrowed and re-widened address?
if it were up to me, I'd treat rS in ld as address-wide, vector or scalar, and use elwidth_src to override exclusively the memory access, uniformly. but I'm not pushing for this, just asking what the intended (specified?) semantics is supposed to be, and for that to be made clear in the specification.
A symmetrical question/request for clarification and (pointers to?) documentation is implied WRT stores and elwidth[_dest]
I envision the possibility of additional cases and elaboration of the answer when subvl>1.
I've just realized that the phrase "two vectors" in the subject may be both inaccurate and misleading.
so, to try to be abundantly clear, I'm mainly talking about the (potential) vector of addresses, and the (potential) vector of objects it/they refer to, NOT about the vector register that will hold the loaded values.
also, I am mostly sure that in the end only one of the (potential) vectors ends up being an actual vector, though subvl>1 might actually turn out to make both of them actual vectors.
it also occurs to me now to wonder now whether there is a any case (or way to express) that both are scalars, as in, load this single value from memory, and then place it in all elements of the destination vector.
(In reply to Alexandre Oliva from comment #1)
> I've just realized that the phrase "two vectors" in the subject may be both
> inaccurate and misleading.
> so, to try to be abundantly clear, I'm mainly talking about the (potential)
> vector of addresses, and the (potential) vector of objects it/they refer to,
> NOT about the vector register that will hold the loaded values.
you are therefore probably talking about indexed mode.
i removed indexed mode when illustrating the pseudocode for you because you asked about what is termed "unit stride" mode.
> also, I am mostly sure that in the end only one of the (potential) vectors
> ends up being an actual vector, though subvl>1 might actually turn out to
> make both of them actual vectors.
remember SUBVL is effectively simply a multiplier (num actual elements VL*SUBVL) and that SV is never actually switched off: scalars are just "when SUBVL=1 and VL=1"
> it also occurs to me now to wonder now whether there is a any case (or way
> to express) that both are scalars, as in, load this single value from
> memory, and then place it in all elements of the destination vector.
yyyepp. that's standard twin predication VSPLAT behaviour on top of a LDST "thing".
although i think i see where you're going with this: i will have to check.
idea: re-purpose the 2 bits from src width to specify mode:
* unit strided
* element strided
* imm(r) - straight load
* r(r) - INDEXED load
so, for `ld reg, imm(reg)`, the src elwidth specifies:
0 -- unit stride -- loads from reg + imm + load_size * element_index
1 -- strided with stride of imm -- loads from reg + imm * element_index
written: ld reg, (reg), stride=imm
2, 3 -- reserved -- maybe split imm bits between offset and stride?
written: ld reg, offset_imm(reg), stride=stride_imm
for `ld reg, (base_reg + index_reg)`, the src elwidth specifies the elwidth of index_reg, base_reg is always 64-bit.
similarly for store.
(In reply to Jacob Lifshay from comment #6)
> so, for `ld reg, imm(reg)`, the src elwidth specifies:
> 0 -- unit stride -- loads from reg + imm + load_size * element_index
> 1 -- strided with stride of imm -- loads from reg + imm * element_index
> written: ld reg, (reg), stride=imm
> 2, 3 -- reserved -- maybe split imm bits between offset and stride?
mmm there's only 16 bits (signed/unsigned), not too keen on limiting expectations (and altering compiler from scalar behaviour)
RVV sets an "ordered/unordered" mode, which is interesting.
other options: select to use the *dest* elwidth as the unit stride multiplier. this will give some weird overlaps when using e.g. ld (64 bit) with dest elwidth=8, and some stranger overlaps for ST.
also, it turns out that when RA is vectorised, unit stride is absurd nonsense.
EA = iregs[RA+i] + i*imm
naah. so i disable unit stride there and make it just:
EA = iregs[RA+i] + i*imm
this leaves the "mode" bits doing nothing. what to do there? can we do anything with the 2 bits? put them back to src elwidth?
> written: ld reg, offset_imm(reg), stride=stride_imm
> for `ld reg, (base_reg + index_reg)`, the src elwidth specifies the elwidth
> of index_reg, base_reg is always 64-bit.
yes, this is just necessary.
get_polymorphic_reg(RA, elwidth=64, i)
rather than elwidth=op_wid.
> similarly for store.
yes. it's the EA (effective address)
question is, does "mode" do anything useful? 2 bits, 4 options... i'm honestly not thinking of anything that really stands out except perhaps ordered/unordered
(In reply to Luke Kenneth Casson Leighton from comment #7)
> RVV sets an "ordered/unordered" mode, which is interesting.
and also breaks the expectation of compliance with scalar "Program Order".
so, that's out.
after some thought, the only place i'm seeing it necessary to add a different
mode is on the immediate-version, when the source RA is scalar. funnily
enough this meshes with the fail-first idea which we saw back in bug #561
there's enough bits there to do strange things. just needs properly going
arg the entire table "mode" makes no sense. reduce on LD/ST? err...
the whole thing needs shuffling.