567 – Allow transparent scalar loads and stores to/from registers allocated as vectors

Bug 567 - Allow transparent scalar loads and stores to/from registers allocated as vectors

Summary: Allow transparent scalar loads and stores to/from registers allocated as vectors

Status:	RESOLVED INVALID

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Specification (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Cesar Strauss

URL:

Depends on:
Blocks:

Reported:	2021-01-05 14:17 GMT by Cesar Strauss
Modified:	2021-01-07 00:20 GMT (History)
CC List:	4 users (show)

See Also:	560 571
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Cesar Strauss 2021-01-05 14:17:06 GMT

First, note that I'm preemptively marking this as deferred, as I understand that we have a kind of feature-freeze right now.

Also, apologies to Alexander, this is actually his proposal, in the way I see it. Which is orthogonal to Jacob's proposal, that has to do with bitcast.

So, as I see from the NEON article, a 64-bit register can be partitioned into a certain number of "lanes" of varying widths. For instance:

8 x  8-bit lanes: L[0] L[1] L[2] L[3] L[4] L[5] L[6] L[7]
4 x 16-bit lanes: L[0] L[1] L[2] L[3]
2 x 32-bit lanes: L[0] L[1]
1 x 64-bit lane:  L[0]

There must be a mapping, of each lane, to a range of bits within the register.

We can choose arbitrarily, as long as we are consistent. One choice is:
8-bit:
L[0] -> bits 0 to 7
L[1] -> bits 8 to 15
L[2] -> bits 16 to 23
L[3] -> bits 24 to 31
L[4] -> bits 32 to 39
L[5] -> bits 40 to 47
L[6] -> bits 48 to 55
L[7] -> bits 56 to 63
16-bit:
L[0] -> bits 0 to 15
L[1] -> bits 16 to 31
L[2] -> bits 32 to 47
L[3] -> bits 48 to 63
32-bit:
L[0] -> bits 0 to 31
L[1] -> bits 32 to 47
64-bit:
L[0] -> bits 0 to 63

Another choice is:
8-bit:
L[0] -> bits 56 to 63
L[1] -> bits 48 to 55
L[2] -> bits 40 to 47
L[3] -> bits 32 to 39
L[4] -> bits 24 to 31
L[5] -> bits 16 to 23
L[6] -> bits 8 to 15
L[7] -> bits 0 to 7
16-bit:
L[0] -> bits 48 to 63
L[1] -> bits 32 to 47
L[2] -> bits 16 to 31
L[3] -> bits 0 to 15
32-bit:
L[0] -> bits 32 to 47
L[1] -> bits 0 to 31
64-bit:
L[0] -> bits 0 to 63

Notice, that it's just bit allocations. There isn't any endianess involved, up to now. It only affects the labeling of the "lane write enable" wires for writing, and the "lane valid" wires for reading.

What Alex is proposing, I think, is dynamically switching between the mappings, so that when using a 64-bit scalar load instruction, L[0]=V[0], L[1]=V[1], and so on, irrespective of memory endianess. Correct?

I think this relabeling can be done with a single crossbar on each read port of just the 8-bit predicate mask register, no need to shuffle the actual 64-bit register contents. The partitioned ALUs do not care about which lane number is assigned to each partition number, as long as the predicate mask is correct.

Also, this was only for 64-bit load and stores, I wonder how 32-bit, 16-bit and 8-bit (or even 24-bit, 40-bit, 48-bit, 56-bit) scalar load/stores would work.

Comment 1 Alexandre Oliva 2021-01-05 14:55:47 GMT

My recommendation doesn't involve changing loads or stores in any way.

My recommendation is that the svp64 iteration over sub-register vector elements obeys the in-memory array/vector layout.  I.e., if you load a vector with smaller-than-64-bit elements into a register with a dword load, and iterate over them with svp64, you visit them in the same order you would if iterating over array elements in memory, and in the same order you would if the register held a struct with an array, and you iterated over the elements in it.

This means that while in data LE mode the iteration goes as you wrote under "One choice is", whereas in data BE mode the iteration goes as you wrote under "Another choice is".

Not doing so would impose significant delays and risks of subtle errors onto the compiler.

Comment 2 Luke Kenneth Casson Leighton 2021-01-05 18:50:56 GMT

(In reply to Alexandre Oliva from comment #1)
> My recommendation doesn't involve changing loads or stores in any way.
> 
> My recommendation is that the svp64 iteration over sub-register vector
> elements obeys the in-memory array/vector layout.

ok in standard vector terminology this is named "unit strided", and the pseudocode is as follows (i am drastically simplifying, taking out elwidths, predication, LE/BE, everything, stripping it back to fundamental basics):

    function op_ld(rd, rs) # LD not VLD!
      for (int i = 0, int j = 0; i < VL && j < VL;):
        # unit stride mode
        srcbase = ireg[rs] + i * 8; // assume 64 bit elwidth here
        ireg[rd+j] <= mem[srcbase + imm_offs];
        i++;
        j++;

note that that is not:

    function op_ld(rd, rs) # LD not VLD!
      for (int i = 0, int j = 0; i < VL && j < VL;):
        # unit stride mode "LDR" mode
        #                    vvvvv element-inversion done here
        srcbase = ireg[rs] + (VL-1)-(i * 8);
        ireg[rd+j] <= mem[srcbase + imm_offs];
        i++;
        j++;

that would be NEON LDR as described here:
https://llvm.org/docs/BigEndianNEON.html



>  I.e., if you load a
> vector with smaller-than-64-bit elements into a register with a dword load,
> and iterate over them with svp64, you visit them in the same order you would
> if iterating over array elements in memory, and in the same order you would
> if the register held a struct with an array, and you iterated over the
> elements in it.

this example - 64-bit loads followed by placement into smaller-width elements - would result in truncation of the data, picking only the lower numbered bytes, context being that the regfile SRAM is defined as being LE numbered/ordered from 0 upwards, and the elements being indexed and ordered as LE from 0 upwards.

the pseudocode will therefore be as follows (assume src_elwidth=8 to indicate 64-bit reads):

    function op_ld(rd, rs, brev) # LD not VLD! (ldbrx if brev=True)
      for (int i = 0, int j = 0; i < VL && j < VL;):

        # unit stride mode, compute the address
        srcbase = ireg[rsv] + i * src_elwidth;

        # takes care of (merges) processor LE/BE and ld/ldbrx
        bytereverse = brev XNOR MSR.LE

        # read the underlying memory
        memread <= mem[srcbase + imm_offs];

        # optionally performs 8-byte swap (because src_elwidth=8)
        if (bytereverse):
            memread = byteswap(memread, src-elwid)

        # takes care of inserting memory-read (now correctly byteswapped)
        # into regfile underlying LE-defined order, into the right place
        # within the NEON-like register, respecting destination element
        # bitwidth, and the element index (j)
        set_polymorphed_reg(rd, dest_bitwidth, j, memread)

        # increments both src and dest element indices (no predication here)
        i++;
        j++;


> This means that while in data LE mode the iteration goes as you wrote under
> "One choice is", whereas in data BE mode the iteration goes as you wrote
> under "Another choice is".

ah.  this is incorrect (stemming likely from a misunderstanding of how unit-striding works).  the description that Cesar gave of "Another choice is" is in fact an 8-byte hard-coded BE numbering-reordering/meaning of the regfile SRAM (which we are not doing), we are doing it as:

* LE-ordered elements
* LE-numbered bytes in the underlying SRAM
* LSB0

taking Cesar's original elements, and giving LE "meaning" to the bytes *in* each element, by adding b0 and upwards to indicate them, this comes out as, if i can put it in tabular form:

  byte: 0      1       2     3       4     5       6     7
  bit:  0-7    8-15    16-23 24-31   32-39 40-47   48-55 56-63

 8-bit  L0     L1      L2    L3      L4    L5      L6    L7      8x 8bit
16-bit  {L0.b0 L0.b1} {L1.b0 L1.b1} {L2.b0 L2.B1} {L3.b0 L3.b1}  4x 16bit
32-bit  {L0.b0 L0.b1   L0.b2 L0.b3} {L1.b0 L1.b1   L1.b2 L1.b3}  2x 32bit
64-bit  {L0.b0 L0.b1   L0.b2 L0.b3   L0.b4 L0.b5   L0.b6 L0.b7}  1x 64bit

where within each element, b0 refers to LSByte numbering to mean LSByte and the meanings b1, b2, b3 should follow self-evidently from there.

this is how NEON is encoded: this is how we are doing it.


what Cesar describes under "the other choice" (the one that we are NOT doing) is that the numbering of each of the 8 bytes is inverted (BE-encoded):

  byte: 7      6       5     4       3     2       1     0
  bit:  56-63  48-55   40-47 32-39   31-24 16-23   8-15  0-7

(and the elements remain in the exact same positions: i.e. merely the top-level bit-numbering changes from LE to BE).

or, another way to view it: the byte/bit columns have stayed the same, but it is the ordering of the L* that has swapped (hard-coded anything in column 0 has moved to column 7 and vice-versa, 1 swapped with 6, 2 with 5 and 3 with 4).

let us call this (the one we are not doing) "MSByte0" or more specifically "MSByte0 ordering of 8x byte values" because in each batch of 8 bytes in the underlying SRAM, under "The other choice", it's the lowest-numbered byte that is the MSByte.


note that that is NOT:

        63-56 ......                              7-0

that is termed "MSB0 numbering" and we are NOT doing MSB0 numbering either ("upto" in VHDL terminology).


now.  let us take the case where memory is LDed.  this is going to be a bit challenging, let me think how to do it... ok i think i got it.  we'll define memory as byte-ordered:

   M0 M1 M2 M3 M4 M5 M6 M7.... Mnn

then, let us do a 64-bit src-width LD, unit strided, VL=2, and set dest-width=32.  let us also use a LD operation, and set MSB.LE=1.  this will DISABLE memory-level byteswapping (as expected).  our assignments then occur (using the way we *are* going to be doing the regfile) as:


* L0.b0 = M0
* L0.b1 = M1
* L0.b2 = M2
* L0.b3 = M3

(that covers element 0, where the 64-bit data M0-M7 was truncated as a LE-converted quantity, and stored into a 32-bit element - L0 - in LE byte order).

next:

* L1.b0 = M8
* L1.b1 = M9
* L1.b2 = M10
* L1.b3 = M11

this is element 1, where, again, the 64-bit memory read M8-M15 was truncated then stored into a 32-bit element - L1 - in LE byte order.


now, if you used "The Other choice", that would end up as:

* L0.b0 = M11
* L0.b1 = M10
* L0.b2 = M9
* L0.b3 = M8

* L1.b0 = M3
* L1.b1 = M2
* L1.b2 = M1
* L1.b3 = M0

which will, if we choose this meaning (MSByte0), become utterly confusing: far worse than IBM choosing MSB0.  Cesar describes this as "wiring" which is technically correct however it is a wiring that we know from experience of dealing with MSB0 is absolutely dreadful.  every access to the SRAM will require an insertion of hard-coded "7-idx" in front of it, or a hard-coded 8-byte reversal Cat(list.reverse()).

there is no need for any of that in the HDL if we simply pick LE array and arithmetic ordering, because that's what nmigen does naturally.

now, we *could* pick this numbering as one of the following:

* just a numbering designation (just some wires) but it's such hell to understand
  that we would be effectively smashing our heads against concrete walls and
  then trying to think (i.e. impossible)

* **ACTUAL** inversion - **ACTUAL** byte-inversion.  this would be, frankly,
  too stupid to even contemplate, as it would literally change the definition
  of the VL for-loop depending on the elwidths.

either way, neither of these things is happening.



leaving that aside, let us do BE on our chosen (NEON-like) regfile arrangement.  let us set:

* ld, LE=0 (BE mode), keep everything else the same: srcwid=64, dest=32, VL=2

in the way that we are doing the regfile - the way that NEON does it - this would be:

* L0.b0 = M7
* L0.b1 = M6
* L0.b2 = M5
* L0.b3 = M4

checking against this image from wikipedia:
https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Big-Endian.svg/250px-Big-Endian.svg.png

(needs adjusting to 64-bit but you get the idea).

because the loads are 64-bit, therefore M0 contained the MSByte of the 64-bit numerical value, and M7 contained the LSByte of that same numerical value.  thus, after truncation, L0.b0 contains M7 and so on.  *NOT* M3-M0 because that is the HIGH word of the underlying numerical 64-bit value.

correspondingly, when we get to element 1:

* L1.b0 = M15
* L1.b1 = M14
* L1.b2 = M13
* L1.b3 = M12


now we can finally see where the source of confusion about "The other choice" (MSByte0) is.  can you see that the ordering of the bytes with "The Other Choice" are absolutely *nothing* like those of the BE-ordered ld operation?

this despite the fact that the LD was a 64-bit LD.

i trust that this helps explain why NEON-like LE numbering for both elements and the bytes in the regfile has been chosen?

if that's clear then this bugreport can be closed as INVALID rather than DEFERRED, because there's nothing to change, no spec changes needed (only clarification).

Comment 3 Luke Kenneth Casson Leighton 2021-01-05 18:56:20 GMT

(In reply to Cesar Strauss from comment #0)
> First, note that I'm preemptively marking this as deferred, as I understand
> that we have a kind of feature-freeze right now.

yes.  only spec clarification, not changes to the spec (not at this fundamental level)
 
> We can choose arbitrarily, as long as we are consistent. One choice is:
> 8-bit:
> L[0] -> bits 0 to 7
> L[1] -> bits 8 to 15

this is, to my understanding, how NEON encodes elements within registers.  it's the "LE-byte-numbering, LE-element-numbering" scheme, and it's realistically the only sane way to do things.


> Another choice is:
> 8-bit:
> L[0] -> bits 56 to 63
> L[1] -> bits 48 to 55

this one is what i termed "MSByte0" after the way that IBM does MSB0 numbering.  it's extremely hard to understand, conceptualise and think about, and we should not do it.

MSB0 caused enough problems, requiring weeks of clarification requests to Paul Mackerras and others on the openpower-hdl-cores list, and resulted in months of confusion and bug-fixing of the order of months.

to select MSByte0 ordering of the regfile would literally destroy the entire project's viability.

so, we go with NEON-like encoding.

Comment 4 Cole Poirier 2021-01-05 19:41:57 GMT

(In reply to Luke Kenneth Casson Leighton from comment #2)
> (In reply to Alexandre Oliva from comment #1)
> > My recommendation doesn't involve changing loads or stores in any way.
> > 
> > My recommendation is that the svp64 iteration over sub-register vector
> > elements obeys the in-memory array/vector layout.
> 
> ok in standard vector terminology this is named "unit strided", and the
> pseudocode is as follows (i am drastically simplifying, taking out elwidths,
> predication, LE/BE, everything, stripping it back to fundamental basics):
> 
> [snip]
>
> correspondingly, when we get to element 1:
> 
> * L1.b0 = M15
> * L1.b1 = M14
> * L1.b2 = M13
> * L1.b3 = M12
> 
> 
> now we can finally see where the source of confusion about "The other
> choice" (MSByte0) is.  can you see that the ordering of the bytes with "The
> Other Choice" are absolutely *nothing* like those of the BE-ordered ld
> operation?
> 
> this despite the fact that the LD was a 64-bit LD.
> 
> i trust that this helps explain why NEON-like LE numbering for both elements
> and the bytes in the regfile has been chosen?
> 
> if that's clear then this bugreport can be closed as INVALID rather than
> DEFERRED, because there's nothing to change, no spec changes needed (only
> clarification).

Luke, this is *much* clearer than everything that exists on bug #560. Long story short, if we want to catch up on the significant amount of work we lost because of RV, we *MUST* keep bug reports focused to a single issue, and *MUST* immediately migrate new and unrelated concerns to their own dedicated bug reports without resistance. It is very clear that you cannot cope otherwise, and this is highly detrimental to our productivity.

I think the above worked example explaining OP v3.0B unit striding is a very necessary clarification/example and should be added to the spec to make it clear for implementers, as given that it's taken a week to get to this point of clarity for us, I imagine the vast majority of implementers will get stung by this - potentially catastrophically.

Comment 5 Cesar Strauss 2021-01-06 08:53:58 GMT

(In reply to Luke Kenneth Casson Leighton from comment #2)

> if that's clear then this bugreport can be closed as INVALID rather than
> DEFERRED, because there's nothing to change, no spec changes needed (only
> clarification).

Agreed. Glad to have had the discussion. We can go back to it later, if needed.

Comment 6 Luke Kenneth Casson Leighton 2021-01-06 12:26:55 GMT

(In reply to Cole Poirier from comment #4)

> I think the above worked example explaining OP v3.0B unit striding is a very
> necessary clarification/example and should be added to the spec to make it
> clear for implementers,

already done, a long time ago.  see "polymorphic elwidth on load/store"
and "example tables shwing LOAD elements"

https://libre-soc.org/simple_v_extension/appendix/