Bug 570 - svp64 vector loads: sub-dword selection before or after byte-reversal
Summary: svp64 vector loads: sub-dword selection before or after byte-reversal
Status: RESOLVED INVALID
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: PC Other
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 213
  Show dependency treegraph
 
Reported: 2021-01-06 21:58 GMT by Alexandre Oliva
Modified: 2021-01-07 00:20 GMT (History)
1 user (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexandre Oliva 2021-01-06 21:58:34 GMT
Last night, while going over https://libre-soc.org/simple_v_extension/appendix/ with a particular focus on ld's operation with an elwidth overrider for the src, I missed various details in the specification.

- 3.5 depicts loading full words (on ppc64, presumably dwords), and you said (I suppose it's written somewhere) that BE loads undergo byte-reversal as quickly as possible.  then, as we get on to 4 (not yet subsections thereof), there's pseudocode for polymorphism that deals with accessing parts of a whole register.  this suggested to me that the sub-indexing of the value loaded memory would take place after byte-reversion.  that is probably not the case.  this would mess up the order of loading sub-register vector elements in BE mode even when using an elwidth_src that matched the vector element size (as opposed to wider loads)

- even 4.4 doesn't specify when byte-reversal is to take place when accessing sub-words.  normally, sub-word offseting in BE counts from the opposite end that LE does.  If we're departing from such fundamental assumptions about endianness even when dealing with memory, as we seem to be doing, we have to go way out of our way to make this abundantly clear, specifying explicitly how wide memory fetches are (4.4 seems to do that); stating explicitly when it's the case that expected and usually-implied BE transformations are NOT to be made (e.g. when computating the offs modulo in 4.4), and stating at which point the byte-reversal of loaded dwords or sub-dwords is to take place

- likewise, when we use an array of sub-dword types as a model, even if you state somewhere that the register holds data that has been byte-swapped into LE mode, there must be explicit warnings that that model indexing does not meet the normal expectations of CPU data endianness; specifically, even if the CPU is in BE mode, vector element [0] is to be at the sub-dword holding the bit at 2^0, not the bit 2^{63} as would normally be the case in BE mode.  e.g., loading byte vectors in BE mode in wider-than-byte loads requires undoing the byte-reversal, so that the first element lands around bit 2^0 rather than 2^{63}.  for sub-dword types wider than byte, there is no simple way to shuffle the elements into place after a wide BE load; they *have* to be loaded individually to fall in their place (assuming the previous point doesn't invalidate even this way to load sub-dword BE vectors)


- the register specifying the address to be loaded from can be scalar or vector.  it's not clear how the use of the address register and of the memory location/s named in it relate with elwidth_src.  

-- if the load address register is a vector, is it the case that:

--- elwidth_src specifies the address width, and we take consecutive elwidth_src-wide addresses from the address vector, and load full dwords (or elwidth[_dest]-sized objects?) from each such (extended) address?  (this appears to be the case for the pseudo-code given under 4.4)

--- or does elwidth_src specify the width of each load, and we take consecutive dword-wide addresses from the address vector for each elwidth_src-wide load?

-- if the load address register is a scalar, is it the case that:

--- elwidth_src specifies the width of each load, and we take consecutive elwidth_src-wide elements starting from the address given by the full address register?

--- or does elwidth_src narrow the address register, and we load full dwords starting at that narrowed and re-widened address?
Comment 1 Luke Kenneth Casson Leighton 2021-01-07 00:17:10 GMT
(In reply to Alexandre Oliva from comment #0)
> Last night, while going over
> https://libre-soc.org/simple_v_extension/appendix/ with a particular focus
> on ld's operation with an elwidth overrider for the src, I missed various
> details in the specification.

apologies i should have mentioned, that was the older version of the spec, relevant exclusively to RISC-V. given that RV had BE removed at around version 3 (RISC-III) it was not discussed, at all.

the reason i referred to that older spec was to illustrate to Cole that there did exist walkthroughs for twin element width overrides.



> whole register.  this suggested to me that the sub-indexing of the value
> loaded memory would take place after byte-reversion.

the pseudocode as modified and derived in the other bugreport will be the correct pseudocode.

what you are looking at in the RV variant is unfortunately not relevant, i apologise, as far as bytereversal is concerned, only elwidths and extension.

given that this is the case can i recommend closing this one and starting again, from the pseudocode listed in 567

https://bugs.libre-soc.org/show_bug.cgi?id=567#c2

unfortunately almost all of what you wrote is invalid when viewed without the addition of the (fully OpenPOWER v3.0B Compliant) ld/brx-LE/BE bytereversing that gets everything into NEON-style internal representation and ordering as far as Vectorisation is concerned.

i will go over it thoroughly and make sure nothing was missed but, realistically, we need to close this one as invalid.  sorry about that.
Comment 2 Luke Kenneth Casson Leighton 2021-01-07 00:18:07 GMT
(will raise a new one, immediately)