Bug 571 - svp64 vector loads: sub-dword selection before or after byte-reversal
Summary: svp64 vector loads: sub-dword selection before or after byte-reversal
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Alexandre Oliva
URL:
Depends on:
Blocks:
 
Reported: 2021-01-07 00:19 GMT by Luke Kenneth Casson Leighton
Modified: 2021-01-07 15:22 GMT (History)
2 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2021-01-07 00:19:31 GMT
a review of the pseudocode is needed at
https://bugs.libre-soc.org/show_bug.cgi?id=567#c2
Comment 1 Luke Kenneth Casson Leighton 2021-01-07 00:26:09 GMT
we had to close bug #570 because it was discussing an invalid version of the spec.

the question is: what is the order of indexing vs bytereversal for getting data into internal arithmetic order?

the answers are deducible from how an SV loop must appear to be exactly the same as if sequential LD operations at the scalar level had been in the instruction stream, instead.

this DEFINES how SV operates.  deviations are not permitted, except where things clearly break or are simply never considered before.

thus:

ld{brx} in both LE *OR* BE when VL=2 is **DEFINED** to be two sequential ld{brx} operations, back-to-back, executed in Program Order.

more in a minute
Comment 2 Luke Kenneth Casson Leighton 2021-01-07 01:06:46 GMT
a copy of that pseudocode from the other bugreport.  bear in mind that this is "unit strided" mode, which increments the (normally fixed, constant) immediate offset by an additional amount (in bytes), src_elwidth/8.

the pseudocode will therefore be as follows (assume src_elwidth=64 to indicate 64-bit reads):

    function op_ld(rd, rs, brev) # LD not VLD! (ldbrx if brev=True)
      for (int i = 0, int j = 0; i < VL && j < VL;):

        # unit stride mode, compute the address
        srcbase = ireg[rsv] + i * src_elwidth;

        # takes care of (merges) processor LE/BE and ld/ldbrx
        bytereverse = brev XNOR MSR.LE

        # read the underlying memory
        memread <= mem[srcbase + imm_offs];

        # optionally performs 8-byte swap (because src_elwidth=64)
        if (bytereverse):
            memread = byteswap(memread, src-elwid)

        # takes care of inserting memory-read (now correctly byteswapped)
        # into regfile underlying LE-defined order, into the right place
        # within the NEON-like register, respecting destination element
        # bitwidth, and the element index (j)
        set_polymorphed_reg(rd, dest_bitwidth, j, memread)

        # increments both src and dest element indices (no predication here)
        i++;
        j++;


so the first question was: is sub-dword selection before or after bytereversal, well, the question as asked does not make sense.

the offset selects the *area* of memory containing the element.  there is absolutely no relation between the element indexing and the order of the bytes *in* the element.

the only possible interpretation of the question which might make sense is illustrated by the ARM NEON LDR (Load-Reverse) instruction, where they perform *total* byte-reversal, bytes 0-15 in memory get placed into register bytes 15-0

the pseudocode as listed above SPECIFICALLY does not do that.  however bear in mind that the pseudocode is drastically simplified: REMAP has been removed for example.

REMAP ***IS*** capable of performing the same duties as NEON LDR (and then some)

but let us get clear first about the basics before moving on to that.

* the standard behaviour of SV ld in unit strided mode goes linearly one-for-one contiguously through memory as it goes contiguously up the register numbers

* bytereversal as defined and required for v3.0B compliance *REQUIRES* the XNOR oddness which removes endianness at the memory level and places data into registers that, internally, become DEFINED as NEON-like in behaviour.  byte 0 contains the LSByte; bit 0 contains the LSB.

* AT NO TIME (when REMAP is inactive) is any other reordering, remapping, or definitions in play.

* AT NO TIME (when REMAP is inactive) will elements be anything other than linear, sequential and contiguous, both for src in computing the unit stride memory offset and for dest in picking the target register

the next complication is elwidth overrides (which was where the old SV appendix came in handy)

the dest elwidth part is easy: the registers are defined via the typedef union, and by the set/get polymorphic pseudocode, and with the ordering of elements clear (linear, byte 0 given index 0) and the internal element definition also being clear (linear, LE) i.e. exactly as NEON, the placement of elements is straightforward.

the src elwidth, due to the fact that it is memory, is where it gets odd.

bear in mind we have **THREE** widths here (!)

* ld/lw/lh/lb i.e. the original operation width
* src elwidth override
* dest elwidth override (covered already)

we therefore take the ACTUAL width and the ACTUAL LD as an ACTUAL fully compliant v3.0B LD operation.

this means including the quirky byte-reversal which we have, as above, already diacussed, REMOVES all and any evidence of byte-ordering from the data.

now.

***AFTER*** that data is loaded (which will have been at a nonaligned location), and LE/BE taken care of, we now have a byte, or a hword, or a dword etc, that is in its correct Arithmetic Order, with its bit 0 being in bit 0, and byte 0 being in byte 0.  LSByte is in byte0, LSB is in bit 0.

now - *now* - we have to perform dest elwidth adjustment.

* for a lh operation which loaded 2 bytes, if elwidth=32 then this would involve zero-extending to 32 bits

* for a ld operation which loaded 8 bytes, if elwidth=32 then this would involve *truncation* to 32 bits.

etc. etc.

whilst this may seem weird and redundant because, oink, there is going to be dest elwidth override too, you have to bear in mind that SATURATION Mode can be applied, and that goes IN BETWEEN src and dest elwidth overrides.

we can therefore have a case where:

* lw loads 32 bit elements
* src override is 16 which truncates
* dest does not have an override so the data (now 16 bits long) is placed in a full 64 bit register and the upper 48 bits set to zero.

and many others that make for spectacularly comprehensive combinations.

i leave it at that for now, i will re-read 570 sections on elwidths to see if those were valid.
Comment 3 Luke Kenneth Casson Leighton 2021-01-07 01:36:10 GMT
did a readthrough: the one that was relevant is, where does the src width come from vs what is the unit stride?

and in the *older* spec which did not have dusl elwidth overrides, the src elwidth *was* the stride width *was* the LD operation width (lb, lh, lw, ld).

however since we introduced twin elwidths (for saturation) we have three widths, and therefore it makes sense that:

* the unit stride multiplier comes from the operation (lb, lh, lw, ld)
* the post-load truncation/extension comes from the src elwidth override
Comment 4 Jacob Lifshay 2021-01-07 01:53:24 GMT
(In reply to Luke Kenneth Casson Leighton from comment #2)
> the only possible interpretation of the question which might make sense is
> illustrated by the ARM NEON LDR (Load-Reverse) instruction, where they
> perform *total* byte-reversal, bytes 0-15 in memory get placed into register
> bytes 15-0

Sorry, that's just incorrect: the LDR instruction is ARM's standard load-register instruction. In LE mode, the bytes are not reversed, in BE mode they are, but only because all memory accesses use reversed endian by default, not because LDR is special.
Comment 5 Alexandre Oliva 2021-01-07 14:15:36 GMT
the pseudocode in comment 2 (and in bug 567's comment 2; I'd somehow missed the email about it, and went straight to the simple_v specs as if that was the only answer) makes this mostly clear, thanks

the only unstated implication that comes to mind is that mem and bytereverse are supposed to be polymorphic themselves, and operate on integer types of the source-vector element width.

this can be guessed, but it's not certain from the notation; I think adding the bit-width as a parameter to both pseudofunctions would make it crystal-clear.
Comment 6 Luke Kenneth Casson Leighton 2021-01-07 14:30:26 GMT
(In reply to Jacob Lifshay from comment #4)
> (In reply to Luke Kenneth Casson Leighton from comment #2)
> > the only possible interpretation of the question which might make sense is
> > illustrated by the ARM NEON LDR (Load-Reverse) instruction, where they
> > perform *total* byte-reversal, bytes 0-15 in memory get placed into register
> > bytes 15-0
> 
> Sorry, that's just incorrect: the LDR instruction is ARM's standard
> load-register instruction.

ah, interesting, thank you for the correction.  it was used in that NEON-LLVM write-up.  my take on LDR (on NEON regs) from what i could infer it "effectively" performed byte-reversal [in BE mode], i assumed that was its sole purpose.

i need to add a section on the Appendix covering this, the new one:
https://libre-soc.org/openpower/sv/svp64/appendix/

not the old one:
https://libre-soc.org/simple_v_extension/appendix/
Comment 7 Luke Kenneth Casson Leighton 2021-01-07 14:42:22 GMT
(In reply to Alexandre Oliva from comment #5)
> the pseudocode in comment 2 (and in bug 567's comment 2; I'd somehow missed
> the email about it, and went straight to the simple_v specs as if that was
> the only answer) makes this mostly clear, thanks
> 
> the only unstated implication that comes to mind is that mem and bytereverse
> are supposed to be polymorphic themselves,

mm.... strictly... let me think it through.... this was never true.  the lb/lh/lw/ld even on the older RISC-V version of SV (ok ok the RV version didn't *have* bytereversal) had to take the width from the operation, not the SV polymorphic/elwidth overrides.

with the mem-load (and now with OpenPOWER the bytereversal) being a property of the memory not the core it was - and still is - on the "other side".

most operations in OpenPOWER simply don't allow specifying a width: LD/ST is one of the very few.


> and operate on integer types of
> the source-vector element width.
> 
> this can be guessed, but it's not certain from the notation; I think adding
> the bit-width as a parameter to both pseudofunctions would make it
> crystal-clear.

hmm if all of the SV context parameters are added it makes the line too long.
what i will do instead is add "svctx" as a parameter, then people
can go "svctx.elwidth, oh ok that's passed in"
Comment 8 Luke Kenneth Casson Leighton 2021-01-07 15:22:58 GMT
any good?

https://libre-soc.org/openpower/sv/svp64/appendix/#ldst

(moved to https://libre-soc.org/openpower/sv/ldst/)