560 – big-endian little-endian SV regfile layout idea

Bug 560 - big-endian little-endian SV regfile layout idea

Summary: big-endian little-endian SV regfile layout idea

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Specification (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Jacob Lifshay

URL:

Depends on:
Blocks:	213
	Show dependency tree / graph

Reported:	2020-12-30 18:28 GMT by Luke Kenneth Casson Leighton
Modified:	2023-05-31 20:55 BST (History)
CC List:	5 users (show)

See Also:	564 567
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2020-12-30 18:28:14 GMT

p25 v3.0B spec, idea is to have regfiles swap LE/BE encoding for when VL vector loops run.

Comment 1 Luke Kenneth Casson Leighton 2020-12-30 18:28:57 GMT

note: copy what VSX did.

note: CRs are already encoded in BE format.

Comment 2 Luke Kenneth Casson Leighton 2020-12-30 18:32:17 GMT

lxo: my other issue is about mapping vector operations that operate on sub-register vector elements (say bytes) onto loops over insns, when there are *not* insns that operate on sub-register parts

Comment 3 Alexandre Oliva 2020-12-30 20:00:38 GMT

comment 2 is unrelated and was covered in the call: it was just noise from earlier requirements on compressed instructions, that our decoder had to map things to preexisting ppc insns.  this is not the case of the svp64 loop expander.

as for endianness of vectors, I think an important property to strive for is for the first element of a vector, when the register is used as a vector, to match the scalar version of the same register, when it's used as scalar, regardless of endianness.  things could get very confusing otherwise.

I imagine there may be reasons to do otherwise.  I haven't thought it through.

Comment 4 Jacob Lifshay 2020-12-30 20:26:25 GMT

We should try to make it so storing a vector of one type to memory then loading those same bytes as a vector of a different type can be optimized to just reinterpreting the in-register representation, no byte-swap instructions needed.

Comment 5 Alexandre Oliva 2020-12-30 22:08:34 GMT

the property I mentioned in comment 3 may seem a given for 64-bit ELWIDTH; I failed to mention here that the concern was about sub-register element types.

as for the property you wrote about in comment 4, jacob, I'm having some trouble figuring out just what you mean.  to illustrate, say you have an 8-byte string, or a 4x16-bit (x,y,z,w) tuple held in a 64-bit register.  now you store that register in memory, and load it back, as a vector of 8 bytes, or as a vector of 4 halfwords, onto a different register.  then you compare both registers as scalars, and they should compare equal, is that what you mean, or is there more to it, i.e., something about their maintaining the same relative positions regardless of endianness?


GCC seems to regard vector types just like arrays, when it comes to memory layout, so indexing it operates like indexing arrays.  this does mean, however, that loading the vectors above from memory into a scalar 64-bit register will land element [0] at opposite ends depending on endianness.  this is unavoidable as long as you retain the property of in-memory indexing equivalent to that of arrays, which also fits in with the notion of using neighboring memory areas for neighbor vector elements, as the natural expansion of a sequential for-loop vector load or store would use.

this does suggest that, in order to maintain the property I suggested, the position and iteration order of sub-register elements may have to be affected by vl, or even by both vl and maxvl, depending on endianness.

making it concrete:

if you take a zero-initialized vector, and use a byte-load instruction svp64-prefixed with ELWIDTH=h ELWIDTH_SRC=default to load and zero-extend each byte of a string into a half-word, and then store the registers holding the vector in memory M[], should you get the string's bytes in M[odd] or M[even] bytes?  should it not depend on endianness?

now, if your maxvl is 3, which pair of consecutive bytes in memory is guaranteed to have zeros, M[0..1] or M[6..7]?

does the answer change if maxvl is 4, and but vl is still 3?

what if they're both 7?

does any of this conflict with any of the desirable properties you and I brought up?

Comment 6 Luke Kenneth Casson Leighton 2020-12-30 22:23:41 GMT

right.

ok.

so.

CRs (which caused merry hell to implement) being the guide here, i am very reticent to go down this route, not least because of the time pressure that we are under.

what i am inclined to suggest here is that any kind of in-register byteswapping be performed explicitly by using bitmanip operations.  i started adding some of those at the sv/bitmanip page.

the issue is that we didn't think of this 6-12 months ago, it's only just come up, and there's no spare encoding space.

effectively, considering the regfiles as an SRAM and allowing the data within them to be either LE or BE encoded is something that needs its own dedicated MSR bit.

it's just not "normal practice", plus, to be honest, LDST already has LE/BE and byte-reverse.  if you really want the data in the registers to be inverted, call the bitmanip byte-reverse opcode or push the data out through memory and back.

on balance i am very much disinclined to add this in any way, unless there is a seriously compelling use-case.

Comment 7 Jacob Lifshay 2020-12-30 22:28:28 GMT

(In reply to Alexandre Oliva from comment #5)
> the property I mentioned in comment 3 may seem a given for 64-bit ELWIDTH; I
> failed to mention here that the concern was about sub-register element types.
> 
> as for the property you wrote about in comment 4, jacob, I'm having some
> trouble figuring out just what you mean.  to illustrate, say you have an
> 8-byte string, or a 4x16-bit (x,y,z,w) tuple held in a 64-bit register.  now
> you store that register in memory, and load it back, as a vector of 8 bytes,
> or as a vector of 4 halfwords, onto a different register.  then you compare
> both registers as scalars, and they should compare equal, is that what you
> mean, or is there more to it, i.e., something about their maintaining the
> same relative positions regardless of endianness?

a concrete example:
if the memory at *r3 is:
0x01 0x23 0x45 0x67  0x89 0xAB 0xCD 0xEF
0xFE 0xDC 0xBA 0x98  0x76 0x54 0x32 0x10

setvli r0, 16
ldb r16.v, (r3.s)
setvli r0, 2
std r16.v, (r3.s)

should overwrite the memory at *r3 with the same bytes that were already there.

if the processor is in big-endian mode:
r16 == 0x0123456789ABCDEF
r17 == 0xFEDCBA9876543210

if the processor is in little-endian mode:
r16 == 0xEFCDAB8967452301
r17 == 0x1032547698BADCFE

Comment 8 Jacob Lifshay 2020-12-30 22:29:52 GMT

basically, it will make the in-memory layout of vector types identical to the in-register layout.

Comment 9 Jacob Lifshay 2020-12-30 22:34:52 GMT

I consider CRs weird enough (since they're rarely stored in memory) that they shouldn't be used to decide our in-register layout of vector types, instead, we should strive for consistency between registers and memory, since that makes a bitcast, which is commonly assumed to be zero-cost, actually be zero cost.

Comment 10 Luke Kenneth Casson Leighton 2020-12-30 22:48:21 GMT

(In reply to Alexandre Oliva from comment #5)

> if you take a zero-initialized vector, and use a byte-load instruction
> svp64-prefixed with ELWIDTH=h ELWIDTH_SRC=default to load and zero-extend
> each byte of a string into a half-word, and then store the registers holding
> the vector in memory M[], should you get the string's bytes in M[odd] or
> M[even] bytes?  should it not depend on endianness?

yes... by respecting the MSR LE/BE bit and respecting whether the bytereverse version of LD/ST was called or not.

with the weird ordering-dyslexia that i have i am not really the best person to definitively say "it should be odd or even".

however what i _can_ say is that this is a good point, needs close evaluation, and interaction with the sign-extension, elwidth overrides and also saturation is going to have to be thought through carefully.

i do however expect it to be straightforward and self-evident.

and, i do have to say this: *not* then massively complicating things by adding the extra dimension of the regfile itself being allowed to byteswap will make that evaluation one hell of a lot easier.


> now, if your maxvl is 3, which pair of consecutive bytes in memory is
> guaranteed to have zeros, M[0..1] or M[6..7]?

with VL counting from 0 and with bytes in the regfile SRAM matching that i.e  NOT MASSIVELY COMPLICATING THINGS by having regfile byteswapping the answer needs a walkthrough

reminder: byte-load instruction svp64-prefixed with ELWIDTH=h ELWIDTH_SRC=default 

there is missing information here.  we assume LE processor mode.  also we assume unit-strided Vector LD.

memory is:

0 1 2 3 4 5 6 7
NNMMOOPPQQRRSSTT.. 

this will be:

* load one byte (elwidth src is default)
  at address RA+0
* zero-extend 8 bit to 16 bit
* data is therefore 0x00NN
* vstart=0
* hword containing data goes
  into int_regfile[RT].h[0]

WE ASSUME LE SRAM ON REGFILE BECAUSE OTHERWISE IT IS TOTAL HELL (and needs a usecase, and needs encoding space, and needs full evaluation which we don't realistically have time for)

byte 0 of RT: NN
byte 1 of RT: 00

next, vstart=1

* load next byte (elwidth src is default)
  at address RA+1
* zero-extend to 16 bit
* data is therefore 0x00MM
* vstart=1
* hword containing data goes
  into int_regfile[RT].h[1]

byte 0 of RT: NN
byte 1 of RT: 00
byte 2 of RT: MM
byte 3 of RT: 00

and for vstart=2 it should be clear.


> does the answer change if maxvl is 4, and but vl is still 3?

no.

> 
> what if they're both 7?

byte 0 of RT: NN
byte 1 of RT: 00
byte 2 of RT: MM
byte 3 of RT: 00
byte 4 of RT: OO
byte 5 of RT: 00
byte 6 of RT: PP
byte 7 of RT: 00
# now we have crossed over a 64 bit boundary
# into the next register, RT+1
byte 0 of RT+1: QQ
byte 1 of RT+1: 00
byte 2 of RT+1: RR
byte 3 of RT+1: 00
byte 4 of RT+1: SS
byte 5 of RT+1: 00
byte 6 of RT+1: unmodified
byte 7 of RT+1: unmodified

if that is not the desired layout, the simple solution: call a bitmanip byteswapping opcode.  vectorised of course.

Comment 11 Luke Kenneth Casson Leighton 2020-12-30 23:25:15 GMT

(In reply to Jacob Lifshay from comment #8)
> basically, it will make the in-memory layout of vector types identical to
> the in-register layout.

which i already had such insane difficulty getting right that modifying it in the least bit is categorically a bad idea.  i literally had to try all possible permutations of byteswapping until the unit tests happened to pass.  i do not wish to go through that again.

Comment 12 Alexandre Oliva 2020-12-30 23:29:52 GMT

hmm...  jacob wrote:

> basically, it will make the in-memory layout of vector types identical to the in-register layout.

which is at odds with what luke wrote:

> WE ASSUME LE SRAM ON REGFILE BECAUSE OTHERWISE IT IS TOTAL HELL

now, I don't know what LE SRAM really means.  to me, the register file has no endianness whatsoever; each register holds a number of bits, from least to most significant, that get ordered one way or another, depending on hardware configuration, when storing the register in memory, or when loading it from memory.  memory has endianness, because it's external to the cpu, and it can be accessed and addressed in different granularities; registers don't, because they are conceptually just a collection of flip-flops that are operated on as a coherent unit.  arguing about their bit order is like arguing whether the flip-flops have to be ordered left-to-right or right-to-left.  it doesn't matter, as long as the wiring connects each bit to the right place.  the relevant concept is significance (from least to most significant), not lowest or highest address, which applies to memory order.

having expressed this distinction on how I reason, hopefully it will be clear why your response is not an answer to my question.  you don't get down to the memory addresses, you just number the bytes of the registers, which might be ordered by significance, or by corresponding memory address if stored.

now, does assuming "LE processor mode" imply we don't care about BE at all?  that would remove a lot of the potential complications I was running up against.

or are you talking about code vs data endianness? (IIRC PPC can choose them separately)

anyway, *my* suggestion was just that the iteration order over sub-register vector elements made it so that the first element matched the register when used as a scalar, so as to avoid the risk that loading a single byte, using a vector-prefixed scalar load, lands it in the MSByte of the register, unlike the byte-load instruction.  alas, with big endian, it looks like the natural whole-register array layout would map the first element to the most-significant end of the register instead.

Comment 13 Luke Kenneth Casson Leighton 2020-12-30 23:31:44 GMT

(In reply to Jacob Lifshay from comment #9)
> I consider CRs weird enough (since they're rarely stored in memory) that
> they shouldn't be used to decide our in-register layout of vector types,

very much agreed.

> instead, we should strive for consistency between registers and memory,
> since that makes a bitcast, which is commonly assumed to be zero-cost,
> actually be zero cost.

no.  the code works right now.  it gets things right, and it's compliant with OpenPOWER.

making change for changes sake has a cost (that we cannot afford).

you need to provide a clear use-case such as "10% performance increase will result with this change which is used in NN% of code and consequently it has high value"

a bitmanip bytereverse opcode or simply using the appropriate bytereverse ld should be more than enough.  as should swizzle, in some cases.

Comment 14 Jacob Lifshay 2020-12-30 23:40:55 GMT

(In reply to Luke Kenneth Casson Leighton from comment #13)
> (In reply to Jacob Lifshay from comment #9)
> > I consider CRs weird enough (since they're rarely stored in memory) that
> > they shouldn't be used to decide our in-register layout of vector types,
> 
> very much agreed.
> 
> > instead, we should strive for consistency between registers and memory,
> > since that makes a bitcast, which is commonly assumed to be zero-cost,
> > actually be zero cost.

in SIMD code, bitcasts are actually very common, easily several percent of operations. I can't say how much that will translate to SV, but it is significant for performance.

Comment 15 Jacob Lifshay 2020-12-30 23:51:18 GMT

(In reply to Alexandre Oliva from comment #12)
> hmm...  jacob wrote:
> 
> > basically, it will make the in-memory layout of vector types identical to the in-register layout.
> 
> which is at odds with what luke wrote:
> 
> > WE ASSUME LE SRAM ON REGFILE BECAUSE OTHERWISE IT IS TOTAL HELL
> 
> now, I don't know what LE SRAM really means.  to me, the register file has
> no endianness whatsoever; each register holds a number of bits, from least
> to most significant, that get ordered one way or another, depending on
> hardware configuration, when storing the register in memory, or when loading
> it from memory.  memory has endianness, because it's external to the cpu,
> and it can be accessed and addressed in different granularities; registers
> don't, because they are conceptually just a collection of flip-flops that
> are operated on as a coherent unit.  arguing about their bit order is like
> arguing whether the flip-flops have to be ordered left-to-right or
> right-to-left.  it doesn't matter, as long as the wiring connects each bit
> to the right place.  the relevant concept is significance (from least to
> most significant), not lowest or highest address, which applies to memory
> order.

byte order *is* significant in registers precisely because we can treat them as an indexed vector of bytes by using vector u8 instructions. having that vector of bytes match the vector of bytes in memory is important for performance and consistency, since otherwise we will have to insert tons of byte-swap instructions for memory-order bitcasting that would otherwise be totally unneeded.
> 
> anyway, *my* suggestion was just that the iteration order over sub-register
> vector elements made it so that the first element matched the register when
> used as a scalar, so as to avoid the risk that loading a single byte, using
> a vector-prefixed scalar load, lands it in the MSByte of the register,
> unlike the byte-load instruction.  alas, with big endian, it looks like the
> natural whole-register array layout would map the first element to the
> most-significant end of the register instead.

As far as I know, the plan is for scalar subvl=1 sources and destinations to read and write full registers, not just the first or last byte/u16/u32/u64. This is similar to how divwu writes the whole dest register even though the result is only 32-bits.

for vector registers and/or subvl>1, they use only the part of the registers that correspond to indexes<VL.

Comment 16 Luke Kenneth Casson Leighton 2020-12-30 23:55:52 GMT

(In reply to Alexandre Oliva from comment #12)

> now, I don't know what LE SRAM really means. 

see the union datastructure.
https://libre-soc.org/openpower/sv/overview/#elwidths


> to me, the register file has
> no endianness whatsoever; 

in a scalar system you would be absolutely correct.

with the elwidth overrides effectively being a union of

uint8_t []
uint16_t []
uint32_t []
uint64_t []

now the underlying order *does* matter.

you set b[1] to 0x55 does that mean that w[0] contains 0xnnnn55nn?

if the underlying SRAM is treated as LE the answer is YES.

if the underlying SRAM of the regfile is treated as "a direct representation of memory" i will be LITERALLY unable to cope with the complexity introduced due to MASSIVE confusion as to what is now in which order.




> each register holds a number of bits, from least
> to most significant, that get ordered one way or another, depending on
> hardware configuration, when storing the register in memory, or when loading
> it from memory. 

yes.  and OpenPOWER has some strict rules that took 5 months to correctly implement.

i am simply not going through that again.


> memory has endianness, because it's external to the cpu,
> and it can be accessed and addressed in different granularities; registers
> don't, because they are conceptually just a collection of flip-flops that
> are operated on as a coherent unit

in a scalar system yes.

in any other vectir system yes.

in our extremely unique system which is effectively a union typecast of different c style arrays with an underlying uint8_t[8] array...

... no.


>  arguing about their bit order is like
> arguing whether the flip-flops have to be ordered left-to-right or
> right-to-left.  it doesn't matter, as long as the wiring connects each bit
> to the right place. 

again: this is correct for scalar ISAs, correct for normal vector ISAs, and completely wrong for SV due to us using the 64 bit scalar regfile to store sequential array overloads.


> having expressed this distinction on how I reason, hopefully it will be
> clear why your response is not an answer to my question. 

it allows me to highlight the error in that reasoning, yes 

> you don't get down
> to the memory addresses, you just number the bytes of the registers, which
> might be ordered by significance, 

no.  a decision has to be made as to which order the uint8[] array elwidth override will place bytes into the underlying 8 bytes of the register.

this categorically and definitively decides in which order the byte-level WRITEENABLE lines will be pulled and numbered: backwards or forwards.

i CANNOT COPE if those numbers are calculated by forcibly subtracting them from 7 in some cases, 3 in others, 1 for halfwords.

it's just an absolute recipe for disaster.

therefore the ordering *IS* LE SRAM order.



> or by corresponding memory address if
> stored.
> 
> now, does assuming "LE processor mode" imply we don't care about BE at all? 
> that would remove a lot of the potential complications I was running up
> against.

the decision has already been made to support Scalar OpenPOWER LE/BE and the code is already in the repository with unit tests passing.

due to it taking 5 months to spot bugs i do not want to touch it again without a damn good reason.
 
> or are you talking about code vs data endianness? (IIRC PPC can choose them
> separately)

it can.  we are not.

> 
> anyway, *my* suggestion was just that the iteration order over sub-register
> vector elements made it so that the first element matched the register when
> used as a scalar,

which is achieved by setting LE SRAM order in the underlying SRAM of the regfile.

> so as to avoid the risk that loading a single byte, using
> a vector-prefixed scalar load, lands it in the MSByte of the register,
> unlike the byte-load instruction.  alas, with big endian, it looks like the
> natural whole-register array layout would map the first element to the
> most-significant end of the register instead.

correct.  hence why i consider it to be a dangerous idea, and had already naturally assumed that the SRAM would be LE, such that the c-based union would be an accurate and definitive representation.

Comment 17 Jacob Lifshay 2020-12-30 23:57:21 GMT

So, assuming r4==0x0102030405060708:
li r3, 5
setvli r0, 4
add r4.v, r3.s, r4.v, subvl=1, elwidth=8-bit

produces on little endian:
r4==0x010203040A0B0C0D

produces on big endian:
r4==0x0607080905060708

Comment 18 Jacob Lifshay 2020-12-31 00:03:14 GMT

(In reply to Luke Kenneth Casson Leighton from comment #16)
> (In reply to Alexandre Oliva from comment #12)
> 
> > now, I don't know what LE SRAM really means. 
> 
> see the union datastructure.
> https://libre-soc.org/openpower/sv/overview/#elwidths
> 
> 
> > to me, the register file has
> > no endianness whatsoever; 
> 
> in a scalar system you would be absolutely correct.
> 
> with the elwidth overrides effectively being a union of
> 
> uint8_t []
> uint16_t []
> uint32_t []
> uint64_t []
> 
> now the underlying order *does* matter.
> 
> you set b[1] to 0x55 does that mean that w[0] contains 0xnnnn55nn?
> 
> if the underlying SRAM is treated as LE the answer is YES.
> 
> if the underlying SRAM of the regfile is treated as "a direct representation
> of memory" i will be LITERALLY unable to cope with the complexity introduced
> due to MASSIVE confusion as to what is now in which order.

it means the registers act just like memory, so little-endian gives w[0] == 0xnnnn55nn, and big endian gives w[0] == 0xnn55nnnn

> > so as to avoid the risk that loading a single byte, using
> > a vector-prefixed scalar load, lands it in the MSByte of the register,
> > unlike the byte-load instruction.  alas, with big endian, it looks like the
> > natural whole-register array layout would map the first element to the
> > most-significant end of the register instead.
> 
> correct.  hence why i consider it to be a dangerous idea, and had already
> naturally assumed that the SRAM would be LE, such that the c-based union
> would be an accurate and definitive representation.

In my proposal it *is* still the completely accurate and definitive version, just that the C union for registers is compiled with the same endianness as memory.

Comment 19 Jacob Lifshay 2020-12-31 00:12:23 GMT

(In reply to Luke Kenneth Casson Leighton from comment #16)
> if the underlying SRAM of the regfile is treated as "a direct representation
> of memory" i will be LITERALLY unable to cope with the complexity introduced
> due to MASSIVE confusion as to what is now in which order.

How about this:

we just define SV to trap when not in little-endian mode for now, we can then take our time to figure out which way the registers go in big endian mode and implement it later.

Sound like a good idea?

Comment 20 Luke Kenneth Casson Leighton 2020-12-31 00:14:53 GMT

(In reply to Jacob Lifshay from comment #15)

> byte order *is* significant in registers precisely because we can treat them
> as an indexed vector of bytes by using vector u8 instructions. having that
> vector of bytes match the vector of bytes in memory is important for
> performance and consistency, since otherwise we will have to insert tons of
> byte-swap instructions for memory-order bitcasting that would otherwise be
> totally unneeded.

no, you just use either ld-reverse or not.  the ld and st operation takes care of the bytereversing, that's why it was added to OpenPOWER.

lhbrx and friends. p60 v3.0B spec.

we simply vectorise these, as-is.


> As far as I know, the plan is for scalar subvl=1 sources and destinations to
> read and write full registers, not just the first or last byte/u16/u32/u64.

... i almost thought you were saying the opposite here.

the rule is: elwidth=default means "do what OpenPOWER v3.0B normally does, and do it EXACTLY and to the absolute letter of the spec".

therefore, in the divw example jacob gave: yes, the top 32 bits would be set to zero 

however if you override elwidth=32 then *even when VL=1* the top 32 bits WILL NOT be overwritten.

why?

because elwidth=32 is a SPECIFIC and direct command to the hardware to set the underlying regfile SRAM write-enable lines  to 0b00001111

for elwidth=default those same lines are *categorically* and *always* wired to 0b11111111

Comment 21 Luke Kenneth Casson Leighton 2020-12-31 00:19:41 GMT

(In reply to Jacob Lifshay from comment #19)

> How about this:
> 
> we just define SV to trap when not in little-endian mode for now,

that punishes BE and makes it a second-rate citizen.

which in turn completely cuts us off from China, Japan and India Industrial markets, where many Industrial Control Systems are still using VME Bus and 68000 family processors.

lhbrx.  vectorised.  data loaded in correc order.  solved.

Comment 22 Luke Kenneth Casson Leighton 2020-12-31 00:23:46 GMT

(In reply to Jacob Lifshay from comment #18)
> (In reply to Luke Kenneth Casson Leighton from comment #16)
> > (In reply to Alexandre Oliva from comment #12)
> > 
> > > now, I don't know what LE SRAM really means. 
> > 
> > see the union datastructure.
> > https://libre-soc.org/openpower/sv/overview/#elwidths
> > 
> > 
> > > to me, the register file has
> > > no endianness whatsoever; 
> > 
> > in a scalar system you would be absolutely correct.
> > 
> > with the elwidth overrides effectively being a union of
> > 
> > uint8_t []
> > uint16_t []
> > uint32_t []
> > uint64_t []
> > 
> > now the underlying order *does* matter.
> > 
> > you set b[1] to 0x55 does that mean that w[0] contains 0xnnnn55nn?
> > 
> > if the underlying SRAM is treated as LE the answer is YES.
> > 
> > if the underlying SRAM of the regfile is treated as "a direct representation
> > of memory" i will be LITERALLY unable to cope with the complexity introduced
> > due to MASSIVE confusion as to what is now in which order.
> 
> it means the registers act just like memory, so little-endian gives w[0] ==
> 0xnnnn55nn, and big endian gives w[0] == 0xnn55nnnn
>

exactly.  which is precisely what i cannot cope with, and require quite literally a brute-force search on permutations of parameters when writing code, in order to get it right.

please don't push any more on this jacob.  please accept that if i cannot cope, and get confused, there is no way i will be able to confidently explain it and justify it to the OPF ISA WG.

Comment 23 Jacob Lifshay 2020-12-31 00:27:31 GMT

(In reply to Luke Kenneth Casson Leighton from comment #20)
> (In reply to Jacob Lifshay from comment #15)
> 
> > byte order *is* significant in registers precisely because we can treat them
> > as an indexed vector of bytes by using vector u8 instructions. having that
> > vector of bytes match the vector of bytes in memory is important for
> > performance and consistency, since otherwise we will have to insert tons of
> > byte-swap instructions for memory-order bitcasting that would otherwise be
> > totally unneeded.
> 
> no, you just use either ld-reverse or not.  the ld and st operation takes
> care of the bytereversing, that's why it was added to OpenPOWER.

no, you don't. bitcasting is a register reinterpret operation (register to register), using load/store operations to implement bitcasting is slow and wasteful (unless you needed to load/store anyway).

on LLVM, bitcasting usually compiles to no instructions, or rarely a register to register move instruction.

> > As far as I know, the plan is for scalar subvl=1 sources and destinations to
> > read and write full registers, not just the first or last byte/u16/u32/u64.
> 
> ... i almost thought you were saying the opposite here.
> 
> the rule is: elwidth=default means "do what OpenPOWER v3.0B normally does,
> and do it EXACTLY and to the absolute letter of the spec".
> 
> therefore, in the divw example jacob gave: yes, the top 32 bits would be set
> to zero 
> 
> however if you override elwidth=32 then *even when VL=1* the top 32 bits
> WILL NOT be overwritten.
> 
> why?
> 
> because elwidth=32 is a SPECIFIC and direct command to the hardware to set
> the underlying regfile SRAM write-enable lines  to 0b00001111

Well, I interpret elwidth=32 as a specific command to mean we operate on 32-bit values, so scalars are truncated/sign/zero-extended to/from 32-bits when reading/writing registers.

scalar registers are a *totally different kind* of argument, they are *not vectors*, so therefore are treated like is usual in the scalar instruction set -- registers are treated as a single value.

vectors (scalar with subvl!=1 counts as a kind of fixed-length vector in my mind) treat registers as an array of elements, not as a single value.

Comment 24 Alexandre Oliva 2020-12-31 01:58:55 GMT

> now the underlying order *does* matter.

exactly!  that's why we're debating the iteration order.

see, you're talking about an array of uint8_t, so let's take it from there.

uint8_t foo[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };

; r3 points to foo
  ld r4, 0(r3)
  ld r5, 8(r3)
  setvli r0, 16
  svp64 elwidth_src=8-bit mv r6.s, r4.v  

should r6 be 1, or 8, or should it depend on endianness?

if we're to go by the array layout model in memory, it ought to be 1, which means that, in big-endian mode, the vector iteration order within the register should go from most to least significant, whereas in little-endian mode, it should go the opposite way.  this would maintain the array layout equivalence, and it makes perfect sense when you think of how bytes are laid out in memory in each endianness: little endian means least significant first, big endian means most significant first.  iterating in that order is just natural and expected.

now, if we were to iterate over sub-register types always from least to most significant, then we're effectively reversing the order from the expected memory layout.  IOW, we're visiting first 8, then 7, ... then 1, then 16, then 15, ..., then 9.  are you sure this is what you want?

BTW, should the load sequence into r4..r5 above be equivalent to:

  setvli r0, 2
  svp64 elwidth=64-bit, elwidth_src=64-bit ld r4,0(r3)

and should that really get the reversed byte order in the registers that, in big-endian mode, you say we'd get with:

  setvli r0, 16
  svp64 elwidth=8-bit, elwidth_src=8-bit ld r4,0(r3)

?


the point is, we have to make a choice here.  do we choose

a) compatibility with the memory/data endianness selected for the system, and set the iteration order in sub-register vector elements to match, or 

b) a preferential endianness for such vectors that breaks the expected identity between the two vector loads above, and that forces either

b.1) memory layout of vector types to be shuffled for big endian to match the reversed layout of register-wise stores and loads (which would make the order *wrong* for element-wise stores and loads), or 

b.2) the use of only sub-vector element width even for loads and stores, in big-endian mode, because full-register loads and stores would get them in the reverse order within each register?

Comment 25 Jacob Lifshay 2020-12-31 02:15:45 GMT

(In reply to Alexandre Oliva from comment #24)
> the point is, we have to make a choice here.  do we choose
> 
> a) compatibility with the memory/data endianness selected for the system,
> and set the iteration order in sub-register vector elements to match, or 

I think option a makes waay more sense. it also *does* give a performance advantage since bitcasts compile to 0 instructions.

the code Alexandre wrote should *always* put 1 in r6, even if the high bytes of r6 weren't zero before (combining both scalar-means-full-register with memory-endian for registers).

Comment 26 Alexandre Oliva 2020-12-31 02:28:10 GMT

jacob, now add 0x80 to each of the vector elements and tell me whether r6 should end as 0x81 or -1 ;-)

(i.e., zero or sign extension :-)

Comment 27 Jacob Lifshay 2020-12-31 02:39:35 GMT

(In reply to Alexandre Oliva from comment #26)
> jacob, now add 0x80 to each of the vector elements and tell me whether r6
> should end as 0x81 or -1 ;-)
> 
> (i.e., zero or sign extension :-)

that depends on the instruction (icr if we added a signed/unsigned bit to svp64, we should). For mv encoded using add, it would be sign extension (i guess), for mv encoded using or, it would be zero extension.

Comment 28 Luke Kenneth Casson Leighton 2020-12-31 05:14:49 GMT

(In reply to Jacob Lifshay from comment #23)
> (In reply to Luke Kenneth Casson Leighton from comment #20)
> > (In reply to Jacob Lifshay from comment #15)
> > 
> > > byte order *is* significant in registers precisely because we can treat them
> > > as an indexed vector of bytes by using vector u8 instructions. having that
> > > vector of bytes match the vector of bytes in memory is important for
> > > performance and consistency, since otherwise we will have to insert tons of
> > > byte-swap instructions for memory-order bitcasting that would otherwise be
> > > totally unneeded.
> > 
> > no, you just use either ld-reverse or not.  the ld and st operation takes
> > care of the bytereversing, that's why it was added to OpenPOWER.
> 
> no, you don't. bitcasting is a register reinterpret operation (register to
> register), using load/store operations to implement bitcasting is slow and
> wasteful (unless you needed to load/store anyway).

ah i was confused by the mention of "memory", i thought you were exclusively referring to memory-to-register trabsfers.

> on LLVM, bitcasting usually compiles to no instructions, or rarely a
> register to register move instruction.

here, is there any reason why bitmanip would be insufficient?

also: we need use-cases to justify the time drain spent doing a comprehensive evaluation.


> > however if you override elwidth=32 then *even when VL=1* the top 32 bits
> > WILL NOT be overwritten.
> > 
> > why?
> > 
> > because elwidth=32 is a SPECIFIC and direct command to the hardware to set
> > the underlying regfile SRAM write-enable lines  to 0b00001111
> 
> Well, I interpret elwidth=32 as a specific command to mean we operate on
> 32-bit values, so scalars are truncated/sign/zero-extended to/from 32-bits
> when reading/writing registers.
> 
> scalar registers are a *totally different kind* of argument, they are *not
> vectors*, 

they are: look at the pseudocode.  they're "degenerate vectors of length equal to one".

i think what you might be imagining to be the case is, "if VL==1 && SUBVL==1 then SV is disabled, and a different codepath followed that goes exclusively to a scalar-only OpenPOWER v3.0B compliant codepath"

this categorically and fundamentally is NOT the case.

i comprehensively analysed and rejected this as far back as... 2018? during the first few weeks/months of developing SV, and documented it, implemented it in spike-sv and added unit tests that implemented the behaviour described in the previous comment.

the behaviour is:

    if VL==1 &&
       SUBVL==1 &&
       ELWIDTH==default && 
       predication==all1s && 
       all_other_features()==default
    then
       behaviour is identical to that
       of scalar but only because
       all for-loops are length 1 and
       all augmentations are "off".

there *is* no separate scalar code-path.  not in the hardware, and not in the simulator.

there *is* only the for-loops, exactly as outlined in the pseudocode.

this is really fundamental and important to grasp


> so therefore are treated like is usual in the scalar instruction
> set -- registers are treated as a single value.

again: look at the pseudocode, and the riscv-isa-tests (sv branch)

over 18 months ago i implemented SV as exactly as in the pseudocode.

it's always been this way: that scalars are "degenerate VL=1".

i also mention it in the notes, and the spec: SV is never really "off", it's just that the default settings *look* like the original unaugmented scalar ISA.

that's what the "scalar identity behaviour" is.  it's "the settings which make SV one-for-one equivalent to scalar behaviour".

one change - just one - and that no longer holds (and the result is pretty disastrous, see below)

to implement what you believe to be the case (which hasn't ever been the case, not in the entire development of SV) actually requires special violation of that behaviour! it will need a special exemption that will actually increase gate count to implement!

more than that, it actually prevents and prohibits desired and desirable behaviour.

what you are effectively expecting is that if VL=1, elwidth is ignored (treated as default, regardless of its actual value) because, well, VL=1 and that's scalar, right?

what if during some algorithm VL happens to be set to 1?   a loop counter happens to become set to 1?

this will result in elwidth being ignored, and catastrophic data corruption will occur, because what should have been an 8 bit operation (had VL been 2) now becomes a 64 bit operation just because VL=1?

to avoid that case, it would *specifically* require a test, inside the inner loop (worst possible place) looking for the case where VL=1 and deploying SIMD style cleanup as described as follows


what if it is specifically desired to modify only the first byte of a 64 bit register? (including for the loop when RA happens to be 1 on a setvl call)

with the current behaviour, this is dead simple: set VL=1, set elwidth=8

done.

however with the change that you propose / expect, the following (expensive, intrusive) tricks must be played:

* read mask-modify write
 - copy the entire 64 bit register
 - use rwlinmi to insert the byte
 - write the entire 64 bit register

 compared to "set elwidth=8" this is staggeringly expensive

* use predication
  - push any current predication source
    on the stack or into a temp reg
  - set the predicate to 0b01
  - set VL=2, elwidth=8
  - perform the operation
  - restore the old predicate

 this is even more wasteful, because it is 100% guaranteed that the 2nd Function Unit will be allocated even for a short duration whilst the predicate register is being read.  a few cycles later the 2nd bit is discovered to be zero abd Sgadiw Cancellation kicks in but for those few cycles, that FU is out of commission.

* use swizzle

  .... i won't go into this one because again it should be clear by now the reasons why i rejected the idea of considering "VL!=1" to be the exclusive, sole guiding factor in enabling SV.

bottom line: elwidth overrides have merit and purpose on their own, regardless of what the value of VL or SUBVL is.

> vectors (scalar with subvl!=1 counts as a kind of fixed-length vector in my
> mind) treat registers as an array of elements, not as a single value.

yes, as a sub-sub-PC, apart from the difference for predication, setting VL=1, SUBVL=3 is near-as-damnit the same as VL=3,SUBVL=1.  it's not but you know what i mean.

consequently yes, VL=1,SUBVL>1 "makes sense" as "being a vector".

however this is very misleading.

fundamentally you need to make an adaptation to "SV is *nevvvverrrr* switched off".  no feature of SV - not VL, not SUBVL, not ELWIDTH, not predication, is truly "off".

they are all independent, and they all have default values.

* predication: all 1s
* elwidth: default-for-instruction
* VL: 1
* SUBVL: 1

Comment 29 Luke Kenneth Casson Leighton 2020-12-31 05:24:56 GMT

(In reply to Jacob Lifshay from comment #27)

> that depends on the instruction (icr if we added a signed/unsigned bit to
> svp64, we should).

tired.  brief.  considered it.  rejected it.  OpenPOWER has extsw.  therefore conpilers spit out code that groks extsw.  vectorisation of that: dead straightforward.

change to add s/u mode needs big RTL change, entire ALUs complete rewrite. intrusive complier behavioural change.  not happy.  don't have time.  too big change, too intrusive, throws away months of work. therefore, reject.

saturation on the other hand: s/u added.  why? because not included in OpenPOWER scalar, therefore *we* define rules, and define them as a post-processing phase just before CR testing.  not intrusive, incremental change, works well.

Comment 30 Jacob Lifshay 2020-12-31 06:27:00 GMT

(In reply to Luke Kenneth Casson Leighton from comment #28)
> (In reply to Jacob Lifshay from comment #23)
> > (In reply to Luke Kenneth Casson Leighton from comment #20)
> > > (In reply to Jacob Lifshay from comment #15)
> > > 
> > > > byte order *is* significant in registers precisely because we can treat them
> > > > as an indexed vector of bytes by using vector u8 instructions. having that
> > > > vector of bytes match the vector of bytes in memory is important for
> > > > performance and consistency, since otherwise we will have to insert tons of
> > > > byte-swap instructions for memory-order bitcasting that would otherwise be
> > > > totally unneeded.
> > > 
> > > no, you just use either ld-reverse or not.  the ld and st operation takes
> > > care of the bytereversing, that's why it was added to OpenPOWER.
> > 
> > no, you don't. bitcasting is a register reinterpret operation (register to
> > register), using load/store operations to implement bitcasting is slow and
> > wasteful (unless you needed to load/store anyway).
> 
> ah i was confused by the mention of "memory", i thought you were exclusively
> referring to memory-to-register trabsfers.
> 
> > on LLVM, bitcasting usually compiles to no instructions, or rarely a
> > register to register move instruction.
> 
> here, is there any reason why bitmanip would be insufficient?

bitmanip can be done, it's just more instructions that wouldn't be needed if endian was consistent between registers/memory.
> 
> also: we need use-cases to justify the time drain spent doing a
> comprehensive evaluation.

yup, I know there are some, I'll look for concrete examples later when I'm not braindead
> 
> 
> > > however if you override elwidth=32 then *even when VL=1* the top 32 bits
> > > WILL NOT be overwritten.
> > > 
> > > why?
> > > 
> > > because elwidth=32 is a SPECIFIC and direct command to the hardware to set
> > > the underlying regfile SRAM write-enable lines  to 0b00001111
> > 
> > Well, I interpret elwidth=32 as a specific command to mean we operate on
> > 32-bit values, so scalars are truncated/sign/zero-extended to/from 32-bits
> > when reading/writing registers.
> > 
> > scalar registers are a *totally different kind* of argument, they are *not
> > vectors*, 
> 
> they are: look at the pseudocode.  they're "degenerate vectors of length
> equal to one".
> 
> i think what you might be imagining to be the case is, "if VL==1 && SUBVL==1
> then SV is disabled, and a different codepath followed that goes exclusively
> to a scalar-only OpenPOWER v3.0B compliant codepath"
> 
> this categorically and fundamentally is NOT the case.

yup, totally agreed.

what I meant is that if you have a SVP64 instruction with scalar arguments:

add r10.v, r3.s, r20.v, subvl=1, elwidth=32, mask=r30

for r3 (but not r10 or r20) it reads the full register, independent of whatever values VL and r30 have, and then truncates the read value to 32-bits then does the adds.

add r3.s, r10.v, r20.v, subvl=1, elwidth=32, mask=r30

for r3 it writes the full register, independent of whatever values VL and r30 have (unless r30==0, then r3 is unmodified), sign/zero-extending the 32-bit sum into the full 64-bit value that is written to r3.

this full register read/write is particularly important for f32 operations, where the scalar representation is in full f64 format (because OpenPower's weird):

fadds f10.v, f3.s, f20.v, subvl=1, elwidth=32, mask=r30

if f20 holds 0x3F800000 0x40000000 (1.0f32 2.0f32)
and f3 holds 0x3FF0000000000000 (1.0f64)
and VL == 2 and r30 == 0b11

then f10 will hold 0x40000000 0x40400000 (2.0f32 3.0f32)


basic summary: VL=1 is not special, mask with only 1 bit set is not special.
SUBVL=1 *and* reg set to scalar is special. SUBVL>1 and/or reg set to vector is *not* special.

Comment 31 Jacob Lifshay 2020-12-31 06:31:34 GMT

(In reply to Jacob Lifshay from comment #30)
> (In reply to Luke Kenneth Casson Leighton from comment #28)
> > (In reply to Jacob Lifshay from comment #23)
> > > scalar registers are a *totally different kind* of argument, they are *not
> > > vectors*, 
> > 
> > they are: look at the pseudocode.  they're "degenerate vectors of length
> > equal to one".

clarification: don't agree with this
> > 
> > i think what you might be imagining to be the case is, "if VL==1 && SUBVL==1
> > then SV is disabled, and a different codepath followed that goes exclusively
> > to a scalar-only OpenPOWER v3.0B compliant codepath"
> > 
> > this categorically and fundamentally is NOT the case.

agree that VL==1 && SUBVL ==1 being special is not the case.
> 
> yup, totally agreed.

Comment 32 Luke Kenneth Casson Leighton 2020-12-31 16:35:39 GMT

alexandre i haven't forgotten you i want to follow,this thread first, i'll come back to what you wrote after.

(In reply to Jacob Lifshay from comment #30)

> bitmanip can be done, it's just more instructions that wouldn't be needed if
> endian was consistent between registers/memory.

i get it: i just don't like it (due to how intrusive a change it is on the codebase), and my intuition is also ringing some alarm bells when it comes to the width-extension.

i have a sneaking suspicion that the width-extension would end up jamming the bytes into completely the wrong end (as if they had been shifted up).

that the "simple looking" perfectly reasonable assumption that memorder equals regSRAMorder is far from simple: it holds only when the bitwidth is 64, and screws 32-bit and 16.  or if you declare it as 32 bit, it screws 64 and 16.

a case could be made that "ok ok you analyse that case and get it right in hardware" however even just having to consider that and go through it comprehensively *we do not have time*.

my estimates are that this exercise will add somewhere around 6 to 8 weeks onto our already pressurised timescales.

i would far rather that this be declared "a future problem solvable with a new MSR bit" and leave it at that for now.



> > 
> > also: we need use-cases to justify the time drain spent doing a
> > comprehensive evaluation.
> 
> yup, I know there are some, I'll look for concrete examples later when I'm
> not braindead

:)

they will be important to give us an indication of how much a priority this really is.
> > i think what you might be imagining to be the case is, "if VL==1 && SUBVL==1
> > then SV is disabled, and a different codepath followed that goes exclusively
> > to a scalar-only OpenPOWER v3.0B compliant codepath"
> > 
> > this categorically and fundamentally is NOT the case.
> 
> yup, totally agreed.

wheww, because it would be a bit of a disaster :)
 
> what I meant is that if you have a SVP64 instruction with scalar arguments:
> 
> add r10.v, r3.s, r20.v, subvl=1, elwidth=32, mask=r30

ah.  right.  ok so i had to do an update to the overview page about this, because we made the change that both src and dest can have different elwidths.

so ahh unfortunately the example you give is ambiguous because i do not know if you meant that src *and* dest are 32 bit, or just dest (because you forgot to add elwidth_src=something)

let me see if i can work it out / deduce it...

> for r3 (but not r10 or r20) it reads the full register,

ah.  "reads r3 at full width" this means "elwidth_src=default" was missing from the prefix:

   add r10.v, r3.s, r20.v, subvl=1, elwidth=32, mask=r30, elwidth_src=default


r20 on the other hand being 3 bit... deep breath...., ***No***.  that is precisely and exactly i described in the previous comment, taking about an hour to do so, explaining why it is dangerous.

please understand: this is ABSOLUTELY FUNDAMENTALLY CRITICAL to treat scalars as not being scalars at all but as being "vectors of length 1"

i get why it would seem to make sense, because it would mean that a 64 to 32 bit register copy is needed, with the top bits being zero in the destination.  or that the top bits of r3.s would need to be zero'd out.

this is just how it has to be, not least because that copy-of-length-1 (into r3.s) serves a hugely significant purpose of avoiding lane-crossing when the vector-add is performed.

the "proper" solution is in fact to add src1 elwidth  src2 elwidth src3 elwidth etc etc etc. all of which was part of SV-Orig and we simply don't have the space for it in svp64.

elwidth overrides are already a sub-par performance route.  strictly speaking we shouldn't be allowing them at all because the lane-crossing that results has a huge impact.

it's CISC basically.

but, Lauri made a good case for allowing src-dest elwidth overrides, to support saturation properly.  so... it's in.



> independent of
> whatever values VL and r30 have, and then truncates the read value to
> 32-bits then does the adds.
> 
> add r3.s, r10.v, r20.v, subvl=1, elwidth=32, mask=r30

this is a declaration that the destination is 32 bit.  that means DO NOT touch the top word of r3.  end of story.  (otherwise it has to go via the lane-crossing path)
 
> for r3 it writes the full register, independent of whatever values VL and
> r30 have (unless r30==0, then r3 is unmodified), sign/zero-extending the
> 32-bit sum into the full 64-bit value that is written to r3.

NO.  i will say it again: this is not a good idea. i will say it again: it results in lane-crossing and that kills performance.


> this full register read/write is particularly important for f32 operations,
> where the scalar representation is in full f64 format (because OpenPower's
> weird):

right.  here we need to overload fmv and/or add an fcvt-for-mv operations, specifically to deal with this.  and/or use a mode bit (somehow) to indicate that the fmv is to perform fcvt.

in RISCV the fcvt operation already exists because of the difference in the formats.

overloading the elwidths on RV was easy.

VSX *does* actually have such an operation precisely because the FP32 values are indeed packed.

with both RV and VSX having fcvt operations it is not unreasonable to add them to SV.

yes it is a royal nuisance.

yes keeping the behaviour of "scalar is just a degenerate vector of length 1" is that critical.

we *do not* want to be special-casing instructions based on what the width happens to be.

things are already complicated enough and borderline CISC.


> basic summary: VL=1 is not special, mask with only 1 bit set is not special.
> SUBVL=1 *and* reg set to scalar is special. 

categorically, fundamentally and absolutely NO.

the problems this will cause are too great.

scalars are degenerate vectors of length 1, period.

no exceptions.  no special cases.  not ever.

if there are problems caused by that, such as OpenPOWER annoyingly storing scalar FP32 sprinkled throughout the full 64 bits then it is dealt with by adding an fcvt instruction.

*not* by violating the rules of SV by adding special cases.

the reason is down to the fact that elwidths is already complex enough, causing lane-crossing that will dramatically slow down operations.

far from being a "disadvantage" that fvct operation will actually speed up execution by aligning the elwidths of all operands.

no special cases.  aside from anything it will be weeks to go through all the documentation and update them all.

and you know the answer on that one: we don't have time.

we need to move to implementing in the next few days, maximum of 10-14 days further delay and even that is too long.

Comment 33 Jacob Lifshay 2020-12-31 19:02:37 GMT

Some very approximate statistics on how common bitcasting is in vector code:
jacob@jacob-desktop:~/llvm-project$ find | grep '.*\.ll$' | xargs -d$'\n' grep '<[0-9]\+ x ' | sed 's/^\([^:]*\):.*/\1/' | uniq | xargs -d$'\n' cat | grep bitcast | wc
  31146  353435 1776041
jacob@jacob-desktop:~/llvm-project$ find | grep '.*\.ll$' | xargs -d$'\n' grep '<[0-9]\+ x ' | sed 's/^\([^:]*\):.*/\1/' | uniq | xargs -d$'\n' cat | wc
2090019 14424313 95369527
jacob@jacob-desktop:~/llvm-project$ git status
HEAD detached at llvmorg-11.0.0
nothing to commit, working tree clean

So 31146 bitcast instructions out of 2090019 lines of LLVM IR that mention vector types somewhere.
31146*100%/2090019=1.49%
Note that this is mostly LLVM's testsuite so likely not as representative of actual code. Also, this statistic includes comments and blank lines and isn't representative of the part we care about -- how often bitcasts are executed in performance-sensitive code.

Comment 34 Luke Kenneth Casson Leighton 2020-12-31 19:44:21 GMT

(In reply to Jacob Lifshay from comment #33)
> Some very approximate statistics on how common bitcasting is in vector code:

appreciated.


> 31146*100%/2090019=1.49%
> Note that this is mostly LLVM's testsuite so likely not as representative of
> actual code.

understood.  it's a good start.

https://libre-soc.org/openpower/sv/fcvt/

that's going to take a while and need a few pages to go through.  however my feeling is that it can be done as part of implementation: i.e. actually trying to implement it and do the unit tests will require the coverage that in turn will reveal patterns, from which a spec can be written.

the takeaway insight is that single precision opcodes are "half accuracy" ops that allow intermingling of full accuracy ops without fcvts.  thus single precision add on elwidth=32 means "do the op at FP16 accuracy then distribute the bits across FP32 dest register in FP32 format".

Comment 35 Luke Kenneth Casson Leighton 2020-12-31 21:44:22 GMT

(In reply to Alexandre Oliva from comment #24)
> > now the underlying order *does* matter.
> 
> exactly!  that's why we're debating the iteration order.
> 
> see, you're talking about an array of uint8_t, so let's take it from there.
> 
> uint8_t foo[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };
> 
> ; r3 points to foo
>   ld r4, 0(r3)
>   ld r5, 8(r3)

load double.  let us assume LE mode.  this should, i believe (but may get this wrong) set r4 equal to 0x07060504030201 and r5 to 0x000f0e0d0c0b0a0908

this basically places values as follows:

int_regs[4].actual_bytes[0] = 0x01
int_regs[4].actual_bytes[1] = 0x02
etc.

where the actual_bytes is in LE order and all of the union c-array members are in LE order.

>   setvli r0, 16
>   svp64 elwidth_src=8-bit mv r6.s, r4.v  

let me work through this.

* elwidthsrc=8 but dest has not been specified, therefore it defaults to 64 bit.
* r6 is the dest, set as a scalar, and it is a 64 bit scalar.

the union of types for reg_t has already had the underlying SRAM store the data in int_regs[4].actual_bytes as 0x0807060504030201

when the for-loop of the mv reads the src, the operation is:

   result = get_polymorphed_reg(4, 8, 0)
   set_polymorphed_reg(6, 64, 0, result)

the fetch from the regfile of r4, element 0 @ bitwidth 8 goes to this line:

    if bitwidth == 8:
        return int_regfile[4].b[0]

therefore the value 0x01 is returned because by accessing b[0] this is getting byte 0 of actual_bytes, and everything is in LE order therefore 0x01 is returned.

that is then zeroextended to 64 bit and stored in

      int_regfile[6].d[0]

this storage, again, is LE SRAM, LE actual_bytes, LE union struct members therefore the value 0x01 goes into byte zero of regfile 6's actual_bytes.


> should r6 be 1

correct.  as long as i got the original LD ccorrect.  which i cannot guarantee because i get confused about what LE and BE mean.

>, or 8,

not a chance.  same caveat above though.

> or should it depend on endianness?

if we wish to make things so insanely complex that even the inventor of SV cannot understand or cope with it... yes.

(answer: no)

> if we're to go by the array layout model in memory, it ought to be 1, which
> means that, in big-endian mode, the vector iteration order within the
> register should go from most to least significant, whereas in little-endian
> mode, it should go the opposite way.

this is so confusing to me i can't even interpret it. here is why:

1)  do you mean that the 0..VL-1 loop should go in the reverse order?
2) do you mean that the actual_bytes of each reg_t should be in reverse order?
3) do you mean that the bytes of each of the c arrays in the union should be in the reverse order?

or do you mean any permutation of those 8 possible combinations?

all eight possibilities are perfectly valid when it comes to considering a "meaning" for BE.

does this confusion, already on top of something that literally took 5 months to get right, is not a good idea?


>  this would maintain the array layout
> equivalence, and it makes perfect sense when you think of how bytes are laid
> out in memory in each endianness: 

but unfortunately, due to a strange form of dyslexia, i don't know what that is.  it takes me several minutes to hours to go through sonething that, to you, looks "obvious".

> little endian means least significant
> first, big endian means most significant first.  iterating in that order is
> just natural and expected.

yes... but on what? the vector array, the 8 bytes of the SRAM per reg, or the bytes in the individual elements?


 
> now, if we were to iterate over sub-register types always from least to most
> significant, then we're effectively reversing the order from the expected
> memory layout.  IOW, we're visiting first 8, then 7, ... then 1, then 16,
> then 15, ..., then 9.  are you sure this is what you want?

the walkthrough of the pseudocode should make it clear that when considering the underlying SRAM as LE and literally implemented in c, it is not as you've listed.


 
> BTW, should the load sequence into r4..r5 above be equivalent to:
> 
>   setvli r0, 2
>   svp64 elwidth=64-bit, elwidth_src=64-bit ld r4,0(r3)

it is not possible to set elwidth=64.  the options are 8, 16, 32 and "default"

this operation will action 2 LDs.  due to elwidths=default it is functionally directly equivalent to two scalar LDs (with unit strided offsets)

     ld r4, 0(r3)
     ld r5, 8(r3)

i.e. exactly as in the example. 

> and should that really get the reversed byte order in the registers that, in
> big-endian mode, 

if the processor mode is BE, the LD above gets a BE LD.  due to the underlying SRAM being LE this will indeed result in a bytereversal of the data before being put into actual_bytes as 0x0102030405060708

otherwise, if the processor mode is LE it will be exactly as the walkthrough i did above.

this is just how OpenPOWER works.  it's confusing as hell, took me 5 *months* to get right, and trying to change it will require some SERIOUS justification to explain to the OpenPOWER Foundation ISA WG.

now, *when we have time* this can be revisited and added via a MSR bit.


> you say we'd get with:
> 
>   setvli r0, 16
>   svp64 elwidth=8-bit, elwidth_src=8-bit ld r4,0(r3)
> 
> ?

this one is a fascinating degenerate case because you cannot bytereverse a byte.  therefore regardless of LE or BE mode they both do the same thing.

the use of elwidths=8 has effectively "overridden" the "ld" to make it a "lb" (load byte) operation.

the loop becomes:

    for i in range(16):
         res = MEM(r3+i) # one byte
         int_regs[4].b[i] = res

which will end up, in both LE and BE mode, storing the data in actual_bytes as 0x0807060504930201 in r4 and 0x000f....09 in r5.

consider each case to be like this:

    r4 = 0
    r5 = 0
    for i in range(8):
         res = MEM(r3+i)
         res << (8*i)
         r4 |= res
         res = MEM(r3+i+8)
         res << (8*i)
         r5 |= res

NOT repeat NOT

    for i in range(8):
         res = MEM(r3+15-i)
         res << (8*i)
         r4 |= res
         res = MEM(r3+7-i)
         res << (8*i)
         r5 |= res

or any other type of loop which is hard to justify and explain.


> 
> the point is, we have to make a choice here. 

already made, 18+ months ago.  captured by the SRAM of the regfile, as specified by the c data structure, being defined categorically as LE ordered.

now, admittedly it had never occurred to me that anyone would consider inverting the meanings of the SRAM, hence why i didn't document it (which i will do)

however over the 2+ years of development of SV my thinking has always been in LE order as far as that c-based union is concerned.

changing that now when we are so far behind is, again, just not a good idea.

> do we choose
> 
> a) compatibility with the memory/data endianness selected for the system,
> and set the iteration order in sub-register vector elements to match, or 

no.  because the decision was made 18-24 months ago that the SRAM for the regfile and associated access is in LE.  all documentation has been written with that in mind.  all code has been written with thst in mind.

it's not changing (and can be changed with a separate revision AFTER we have completed the implementation, and have 3-4 months free to discuss it, document it and implement it)

the endianness is "removed" by the HDL code and everything goes into data AS QUICKLY AS POSSIBLE in LE order.

memory is read: the bytes are reversed AS SOON AS POSSIBLE to get the f*** away from BE as quickly as possible.

the entire HDL is in LE order.

the documentation for the regfile SRAM is LE order

the regfile implementation is in LE order

all simulator source code deals with BE by providing a class that presents BE bit accesses as... LE order.

trying to mess with this will literally set us back months.

the decision has already been made, and is not going to change if we want to succeed.

Comment 36 Alexandre Oliva 2021-01-04 02:31:05 GMT

> let us assume LE mode

doh, you can't show it works the same way in both endiannesses by assuming it's the one that gets you the result you one.  you have to work it out in both.  and if you had, you'd have realized that the answer that you say we *must* get is not the one we get with the current specifications.

Namely, with BE, what you load into r4 is 0x0102030405060708, and then, if you iterate over the bytes as a vector in the order you've decided, that starts at the LSB, you visit 0x08 first, then 0x07, till 0x01, then you go to 0x00, 0x0f, 0x0e, ..., 0x09.

Which maybe fits in with the notion of "let's pretend it's all LE", but...  it's not how vectors work in BE mode on ppc.  and, what's more, it's not how you stated you wanted vectors to work in svp64 in BE mode.

if your approach to looking into endianness issues is "let's pretend it's LE", I strongly recommend stating that we'll only support LE period, because these things don't get right just by not looking into them.  that's better than getting dysfunctional hardware out there, and then having to either break compatibility or, worse, maintain it indefinitely, with stuff that hasn't been thought through.

Comment 37 Luke Kenneth Casson Leighton 2021-01-04 03:31:42 GMT

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compldst_multi.py;h=8e9f4ec35718541824a78e3ab857dfbf8f307378;hb=1c274daab0955fa0e1cb98c4fe43709b7f795c99#l500

byte reversal for store is slightly further down. the byte mode is a property of memory, not of the ALUs, regfiles or data paths between them.

alexandre: there is no pretending (or there is, but in a clear definitive sense). it is much stronger than that. there is a choice (a declaration), "data in BE will not be tolerated in ALUs or regfiles, it will be converted to LE, stored as LE, processed as LE and only on moving back to memory converted to BE".

both LibreSOC and microwatt make this decision.

the conversions are done IMMEDIATELY the data is read from memory, where the width is known, and the conversion on store is done at the last minute as well.

no data is EVER permitted to enter not one single ALU or regfile without first being converted to LE.

processing only occurs in LE ALUs.

if this had not been done it would be necessary to have duplicate ALUs: one for LE, one for BE. or, byteswapping would need to be a property of the regfile. clearly that is absurd so the decision has been made to convert and work internally as LE.

this works because it is only the memory interface that has the ordering.

to reiterate: it is *Load and Store* that have the LE/BE property.

does this provide you with the missing information to clarify?

if not then i need to understand what it is that you are missing, so that it can be made clear.

Comment 38 Alexandre Oliva 2021-01-04 03:50:19 GMT

luke, please do work out the example in BE as you did in LE.
it's not me the one who needs the clarification.
your answer to the question directly contradicts what you claim we do and can't change.  one of them has to give.

uint8_t foo[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };

; r3 points to foo
  ld r4, 0(r3)
  ld r5, 8(r3)
  setvli r0, 16
  svp64 elwidth_src=8-bit mv r6.s, r4.v  

should r6 be 1, or 8, or should it depend on endianness?

remember, you answered it had to be 1, but the byte-order reversal from BE to LE at the loads will place the 1 at the wrong end of the vector register for it to be the first 8-bit element per your own description of the expected and intended behavior.

Comment 39 Luke Kenneth Casson Leighton 2021-01-04 04:33:10 GMT

it is currently 4am so please do take that into consideratoon

to get the data into the correct order in the ALUs if it is stored in memory in the "wrong" order then the bytereverse LD/ST operations are (or are not) used.

if the processor is in LE mode and the data in memory is BE ordered then the bytereverse LDST operations are used. result: the data is in LE when stored in the regfile

if the processor is in BE mode and the data in memory is BE ordered then the standard non-reversing LDST opcodes are used ***AND THE THESE ARE ALL PROGRAMMED TO PERFORM A SWAP BECAUSE THE PROCESSOR IS IN BE MODE***. result: data is in LE when stored in the regfile.

in each case: data is stored in LE order in the regfile.

actually what is done is that the MSR.BE flag is XORed with the LDST bytereverse bit from the opcode. it should be in the source code somewhere.

this is how it is done in microwatt, and it's how it's done in libresoc.

i already did the walkthrough but you missed the significance of the clarifying statements about using bytereversed LD/ST.

again: the LE/BE ordering is a property of memory not of the ALUs or regfiles and it is the responsibility of the programmer to use the right one.

the BE memory is inverted at the point of LD because that is where the size (1,2,4,8 bytes) is known.

the use of the correct LD/ST instructions by the programmer take care of removal of the reverse ordering.

thus, beyond that load point the treatment of the data is exactly the same because in both cases it has been converted to the convention of always internally operating in LE.

this is probably why you believe that i did not cover BE in the walkthrough, when if you look at it again, in light of the explanation of LDST bytereverse and how that is XORed with MSR.BE you will see what i did.

again: i reiterate: microwatt does this exact same trick, it's where i got it from

Comment 40 Alexandre Oliva 2021-01-04 05:42:09 GMT

the ld instructions in the example don't request byte-reversal.

in big-endian mode, the lowest-address byte of a dword (the one holding the 1) is the MSByte, and the highest-address byte of that dword (the one holding the 8) is the LSByte: the in-memory-order array {1,2,3,4,5,6,7,8}, if read as a dword, comes out as 0x0102030405060708

without an explicit request for byte-reversal, the ld instruction will keep the MSByte in the MSByte, and the LSByte in the LSByte, even though it seems to reorder the bytes because of the way we represent the register.

our vector insns iterate from LS to MS elements.

therefore, in BE, our vector insn will visit the 8 as the first byte element, and the 1 as the 8th byte element.

can you point at any error in this proof?

yes, if you use an alternate instruction to load the vector (e.g., load with byte swapping, or byte vector load), you get different results, but if you're answering a question other than the one I asked when seemingly answering the question I asked, you're not clarifying, you're making for confusion, because what I expect is an answer to the question I posed, not to something else that you'd rather answer.

now, why is that so freaking important, you might be asking yourself.

it's because there are general expectations of how data is laid out.

when we state the vector can be indexed like an array, it's taken as meaning that neighboring vector elements are at adjacent addresses in memory, and that lower-indexed ones are at lower addresses.

if I tell the compiler that I have an array of 8 bytes and initialize it in memory as {1,2,3,4,5,6,7,8}, it will lay them out this way.

the expectations about loading vectors from memory, and storing them in memory, is also supposed to abide by these normal layout assumptions.

when you index arrays in vector registers, there's still an expectation that the indexing gets you the same elements you'd get in memory order.

but what you're saying is that, in our system, if we do what is generally expected to do the job, you'll get BE vector indexing backwards on a per-load-unit basis.

I say per-load-unit because if you load the vector one byte element at a time, you preserve the indexing in the vector register. if you load a half-word at a time, you get odd and even bytes swapped. if you load a word at a time, you get reversed indexing within each word. and so on.

can that be worked around? quite likely. the compiler will have to figure out how the units get reordered depending on the way the register was set up, and adjust the indexing to match. the problem with that is that there are plenty of operations that are supposed to be no-ops, but that in this scheme won't be, so the compiler might quietly optimize out that which would fix the mess. that's what one gets for breaking generally-held assumptions.

now, the other way to go about this is to never ever allow mode changes in vectors: you load the vector from memory for use in a certain way, you use it that way only, and if you wish to use it a different way, you store it and load it back the other way. and cross your fingers that the compiler won't optimize that out too, because the store and load back also seem like a nop from the general layout model in the compiler.

I see a lot of risk there, risk that could be averted by using an endianness-compatible iteration order within vector registers, that would match (corresponding) memory order. that would enable us to bundle memory ops into dwords, even in BE mode, and have the vector registers hold data in the expected significance order.

without that, departing from the normal conventions, you may think you're avoiding trouble of working out endianness, but instead you're pushing that trouble onto every other upper layer, by turning what is a well-understood model (for those who've worked it out) into something that's different enough to make trouble but not enough to disable all learned intuitions, so it will lead to errors.

Comment 41 Luke Kenneth Casson Leighton 2021-01-04 05:52:07 GMT

the reason for using the (hard) chosen convention of internal LE is down to the way that nmigen refers to bytes and bits.

although it is a wiring decision which order to store in, it requires explicit calling of a reversing function to get data into BE wire order with nmigen.

if that is to be optional then it is not hard wiring it requires a MUX array. if it is to be 1 2 4 8 byte reverse selectable that is a considerqble number of MUXes.

the numbering of bytes in arrays in nmigen is LE ordered.

the arithmetic of the __add__ etc operators all assume LSB0 order

combining these two factors is why all data is stored in LE and processed in LE.

it is why the data is converted out from memory as fast as possible from BE and written to memory as late as possible to BE.

to add BE storage capability (the ability to perform byteswapping) to the regfile would actually be an absolute bloody nuisance, plus it isn't in the least bit clear how that could even be practical given how many possible points such bytereversing could be targetted at.

to be "useful" it would require src and dest regs to have a bit each in svp64 to indicate that the operand was to be bytereversed, at the specified byte size.

given that simply following the convention of bytereversing the data at the load point gets you into natural arithmetic order of the HDL tool that we are using i am really failing to see any benefit here to spending 3 to 4 bits on something that can be dealt with at LD/ST time or, if it is absolutely necessary arithmetically, to use bitmanip opcodes which perform bytereversal if pushing through memory is undesirable.

things that have 10 to 30% usage are our highest priority. 5% is also worth considering.

1% is only worth considering if it can be combined with 5 to 10 other 1% savings.

it needs to be shown that, after the correct LDST is performed, there is still a minimum 5% instructions that can be saved by adding the required 3-4 bits to allow operands to be bytereversed.

plus taken into consideration what would be lost (removed from svp64) by doing so, the penalty for that likely being far greater than a 5% saving.

Comment 42 Luke Kenneth Casson Leighton 2021-01-04 06:05:42 GMT

(In reply to Alexandre Oliva from comment #40)

> without an explicit request for byte-reversal, the ld instruction will keep
> the MSByte in the MSByte, and the LSByte in the LSByte, 

remember that it took me 5 months to get the code passing the unit tests due to a form of dyslexia.

even so i believe that what you say is incorrect.

my understanding is - and bear in mind that the code passes unit tests, for all permutations of XER.LE/BE, for all permutations of whether bytereverse ldst or straight ldst is used, is that the MSByte does NOT go into the MSByte but into the LSByte under the circumstances that you describe.

this for both microwatt and libresoc.

this because it is a chosen convention.

that convention being: ALU and regfile are LE.

therefore it us required - *required* - that BE data be bytereversed when loaded from memory.

this misunderstanding may be the source of confusion.

Comment 43 Alexandre Oliva 2021-01-04 06:09:30 GMT

luke, *please* stick to what I'm writing and suggesting, not to your imaginations and fears,

I have *not* objected to keeping registers in LE.  not at all.

I have *not* suggested adding bits to reverse byte order to any insn or prefix or anything.

My suggestion was to iterate on sub-register vector elements in the (corresponding-to-)memory endianness order: from LS to MS in LE, and from MS to LS in BE.  that's all.

it's just about keeping the indexing in memory order, so that the vector does operate as the same array, whether in memory or in registers, regardless of whether you use the most specific or the most efficient memory operations between them.

not doing that will set back the compiler development time for quite a while, because we'll have to teach the compiler that the registers, when used in vector mode, have endianness messed up in just the right way.  you know the drill, it's a nightmare that we don't want to go through.

Comment 44 Cole Poirier 2021-01-04 06:16:20 GMT

(In reply to Alexandre Oliva from comment #43)
> luke, *please* stick to what I'm writing and suggesting, not to your
> imaginations and fears,

I think this discussion will be much kore productive when it’s not the middle of the night. Why don’t we come back to this in about half a day once we all have had a goodnight’s sleep?

Comment 45 Cole Poirier 2021-01-04 06:16:34 GMT

Removed weird duplicate of comment #44

Comment 46 Luke Kenneth Casson Leighton 2021-01-04 06:26:57 GMT

here we go:

https://github.com/antonblanchard/microwatt/blob/39c826aa46a9dd80a12b572373c55d6156c4df07/execute1.vhdl#L1298

note the XNOR with MSR.LE.  MSR.LE=0 is processor BE mode.

table:

* LDST op is not brev variant LE clear i.e. BE - XNOR sets to 1
* LDST op is brev variant LE clear i.e. BE - XNOR sets to 0
* LDST op is not brev variant LE set - XNOR sets to 1
* LDST op is brev variant LE set - XNOR sets to 1

so yes i was correct.  the **NON BREV** variant when BE is set causes a bytereversal inside LD/ST.

it was an XNOR operator not an XOR operator.

microwatt stores data in regfile in LE order and its ALUs are all processing in LE order.

here's where the actual reversal is done, in LD/ST

https://github.com/antonblanchard/microwatt/blob/39c826aa46a9dd80a12b572373c55d6156c4df07/loadstore1.vhdl#L349

note that the opcode brev bit was XNORed with MSR.LE as input to that function.

confirmed, then:

it is definitely the case that MSByte of memory goes into LSByte of regfile (and ALU) for a ld opcode when the processor is in BE mode.

Comment 47 Alexandre Oliva 2021-01-04 06:29:08 GMT

> the MSByte does NOT go into the MSByte but into the LSByte under the circumstances that you describe

sorry, but that's nonsense.  get some sleep and then think about what you just said.  if it were true, then when you declared:

int64_t i = 1;

the moment i got loaded into a register, the 1, that is in the least significant byte, would be moved to the most significant byte of the register, and so on, with the most significant byte, that is a zero, landing in the last-significant byte in the register.  i.e., the register would hold 72057594037927936 instead of 1.  that's not what you want from dword loads.

reversing bytes is something you do to move the MS bits in one representation to the MS bits in another representation, and the LS bits in one representation to the MS bits in the other representation.  You generally don't want to change their significance, only their order.  You seem to have got them mixed up, which may be caused by this dyslexia you mention, or even be the root cause thereof.

Comment 48 Luke Kenneth Casson Leighton 2021-01-04 06:37:27 GMT

(In reply to Alexandre Oliva from comment #43)
> luke, *please* stick to what I'm writing and suggesting, not to your
> imaginations and fears,

as long as it is not clear to me that you understand how the HDL handles BE/LE right now (stating erroneously that it places MSByte of memory into MSByte of regfile) we cannot proceed to the vector case.

let us stick to the scalar case for now.

can you clarify: were you stating that that is how you would *like* the scalar execution to behave, or were you stating how you *believe* the HDL of both microwatt and libresoc work?

Comment 49 Luke Kenneth Casson Leighton 2021-01-04 06:44:28 GMT

(In reply to Alexandre Oliva from comment #47)

> generally don't want to change their significance, only their order.  You
> seem to have got them mixed up, which may be caused by this dyslexia you
> mention, or even be the root cause thereof.

that's why i pointed you at the microwatt source code, instead.

the XNOR of MSR.LE and the ld op where byterev=0 in the CSV file (or microwatt equivslent) gives a 1

which means: perform the bytereverse.

therefore MSByte of memory goes into LSByte of regfile.

check rhe BR column

https://github.com/antonblanchard/microwatt/blob/39c826aa46a9dd80a12b572373c55d6156c4df07/decode1.vhdl#L283

search for the ld operation.  you will seethat the BR column for ld is zero but ldbrx it is 1.

0 XNOR 0 is 1.  therefore bytereversing is performed.

Comment 50 Alexandre Oliva 2021-01-04 08:04:55 GMT

ok, I can tell you've got the concept of significance all confused with position.

Significance has to do with the quantity the symbol stands for, not with its position.

Consider our numbering system: 123 stands for one hundred, twenty-three. The positional systems assign different significance according to position. Most systems we're used to represent numbers with highest significant at leftmost positions, i.e., we assign a higher significance to the digits on the left. So it's 1 hundred, 2 tens and 3 units. Other numbering systems make the opposite convention: rightmost symbols take on highest significance. E.g., one could read out this number as three-and-twenty-and-a-hundred. That would reverse the order, but not change the significance.

Another example that could help clear up the difference between position and significance is that of date conventions. Month is more significant than day, when counting time, but if you take Dec 25, whether it's represented as 12/25 or 25/12, though the order changes to fit each convention, the month remains the most significant quantity.

Byte swapping for endianness adjustment is like changing the representation of dates: the date remains the same, each component symbol remains the same, but it's moved to the corresponding position of the same significance: the month is moved to the position that signifies the month, and the day is moved to the position that signifies the day.

Loading a big-endian word from memory to a register, reversing bytes, does the same: the clear or set bit that stands for 1 is placed in the corresponding bit in the other representation that stands for 1; the bit that stands for 2, or 4, or 8, 16, etc, is placed in the corresponding bit in the other representation OF THE SAME SIGNIFICANCE. the most significant bit, that is often used to represent the sign, goes in the most significant bit in the other representation too. that reverses the order, but the significands retain their significance, being positioned in the corresponding-significance positions, just like day and month in a date.

do you see now why it doesn't make sense that the conversion from BE to LE (or vice versa) places the MSByte in the LSByte? it doesn't! that would change the represented value. what changes is the order of the bits, so that they *retain* their significance in the alternate representation, i.e., so that they still represent (= mean, signify) the same number.

now, we represent numbers in conventions that tie position with significance. though we normally read them from left-to-right, most arithmetic algorithms that we learn at school go right-to-left. that doesn't change the significance of the digits, does it?

if you take an unmarked soroban abacus (those that have some beads conventionally standing for 1, and others in a separate part of the same row standing for 5), move the beads so that they represent a certain large number, and then look at the abacus in a mirror, you may be a little confused because then the digits, from most to least significant, appear from right to left rather than left to right.

but if you're given a picture of an unmarked soroban, and you're told it might or might not be a picture of a mirror reflection, you won't know how to assign the significance to each row of beads. the unmarked soroban is like a register: without a predefined order (e.g. memory addresses) and convention (e.g., increasing or decreasing significance), you know the bits, but not the significance. now, once you establish a convention of significance, e.g. by training the performance of arithmetic operations from right to left, you have assigned meaning to the rows, but you can still read them right-to-left, least-significant first (three-and-twenty-and-a-hundred) or left-to-right, most-significant first (one hundred, twenty-three). the convention of how significance relates with order of representation in memory is called endianness. if lower-address implies lower-significance, and vice-versa, that's LE; if lower-address implies higher-significance, and vice-versa, that's BE. if we were to take address order as left to right (a frequent but not mandatory convention), then our positional numbering systems would be BE, as would MM/DD, whereas DD/MM would be LE.

I hope this helps.

Comment 51 Cesar Strauss 2021-01-04 10:33:30 GMT

(In reply to Alexandre Oliva from comment #38)
> uint8_t foo[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };
> 
> ; r3 points to foo
>   ld r4, 0(r3)
>   ld r5, 8(r3)
>   setvli r0, 16
>   svp64 elwidth_src=8-bit mv r6.s, r4.v  

It seems to me that to manipulate vectors from memory, you must use vector loads and stores as well, not mix scalar and vector as above. It should be something like:

uint8_t foo[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };
; r3 points to foo
  setvli r0, 16
  svp64 elwidth_src=8-bit ld r4.v, 0(r3)
  svp64 elwidth_src=8-bit ld r5.v, 8(r3)
  svp64 elwidth_src=8-bit mv r6.s, r4.v  

There is no byte-swapping in a vector of bytes, since a byte-swapped byte is just itself. As for the byte order in a vector, note that even in bigendian mode, a vector of bytes is stored sequentially (not byte-reversed every 8 bytes), see bottom of p25 on v3.0B spec.

Comment 52 Cesar Strauss 2021-01-04 12:13:54 GMT

It seems to me that, as long as:

1) we rigorously stick to vector (SVP64, SUBVL) load, stores and operations on vector registers,
2) stick to predication to access its sub-elements,
3) do not use non-SVP64 instructions on register previously used as vectors and vice-versa,
4) do not change SUBVL on the same vector register

Then, the "endianess of the register file", and "VL indexing direction" should become totally transparent (architecturally invisible). We can choose one mode (say LE) and stick to it.

Just my two cents.

I do admit that, as I reread the thread, I'm still thoroughly confused.

Comment 53 Luke Kenneth Casson Leighton 2021-01-04 15:23:11 GMT

(In reply to Alexandre Oliva from comment #50)
> ok, I can tell you've got the concept of significance all confused with
> position.
> 
> Significance has to do with the quantity the symbol stands for, not with its
> position.

probably.  or... *deep breath*... i did not mention that the v3.0B and v3.1B spec use MSB0 bitnumbering.
https://en.wikipedia.org/wiki/Bit_numbering#MSB_0_bit_numbering

whereas both microwatt and libresoc internally use LSB0 numbering
https://en.wikipedia.org/wiki/Bit_numbering#LSB_0_bit_numbering

this causes code to have to do things like 31-i, 7-i, 3-i, 63-i:

https://github.com/antonblanchard/microwatt/blob/39c826aa46a9dd80a12b572373c55d6156c4df07/execute1.vhdl#L841

in the case of CRs this resulted in four months to find and fix a long-standing bug.


> do you see now why it doesn't make sense that the conversion from BE to LE
> (or vice versa) places the MSByte in the LSByte?

deep breath: it doesn't matter if it "makes sense", it's what the actual code - the simulator, the HDL of microwatt and the HDL of Libre-SOC - actually do.

and they all pass all unit tests in both LE/BE mode.

therefore, the thinking - against "common sense" - has to be adjusted to
IBM's way (PowerISA's way) of thinking.

which is known to be spectacularly weird, including being the only modern ISA spec to continue to insist on using MSB0 numbering.

sigh.

now, it *could* be as simple as, "despite it not making sense you used the wrong opcode in the example".  it could be as simple as: you used ld in the example where you should have used ldbrx, because of the IBM/POWER-weirdness.


> memory is called endianness.  if lower-address implies lower-significance,
> and vice-versa, that's  LE; if lower-address implies higher-significance,
> and vice-versa, that's BE.

leaving aside IBM's use of MSB0: unfortunately - we have it as matter of straight fact, from the evidence that i've presented, and from the unit tests passing 100% in both microwatt and libresoc, that an inversion is occurring at the byte level where you are clearly not expecting one, for the opcode named "ld".

now, given that no problems occur in gcc, we can put forward a hypothesis that this bizarrety has been "solved" (including in gcc) by simply using the opposite LD/ST opcode.  ldbrx used rather than ld and vice-versa.

whether that's the case, i don't know.  however i am telling you - fact - and the source code is *not* going to change - because the unit tests pass - fact - in both microwatt and libresoc - that ld *does* do byte-reversal in BE mode and ldbrx does *not* do byte-reversal in BE mode.

correspondingly (because of the XNOR): ld *does not* do byte-reversal in LE mode and ldbrx *does* do byte-reversal in LE mode.

again:

* ld     LE: straight
* ldbrx  LE: byte-reversed
* ld     BE: byte-reversed
* ldbrx  BE: straight

these are the facts, from both codebases, both passing unit tests 100%.

it is LD and LDBRX *in LE mode* that behaves "as expected" (byte order is left alone, according to expectations of what the opcodes "should" do) and it is LD and LDBRX *in BE mode* that has the byte-ordering "reversed" against "expectations" of behaviour for these opcodes.

once the significance of this has sunk in i believe it may start to make sense.  ultimately i think it is that XNOR that is confusing you.

at the moment i cannot yet tell if you are still at the "disbelief of how things work in Power ISA" stage.

this is very common :)


(In reply to Cesar Strauss from comment #52)
> It seems to me that, as long as:
> 
> 1) we rigorously stick to vector (SVP64, SUBVL) load, stores and operations
> on vector registers,
> 2) stick to predication to access its sub-elements,
> 3) do not use non-SVP64 instructions on register previously used as vectors
> and vice-versa,
> 4) do not change SUBVL on the same vector register

deep breath: these are things that were envisaged, from the very beginning (2 years ago) to be allowed.

otherwise we might as well have a completely separate Vector regfile, and we lose the advantage of not having (not needing) inter-regfile conversion / mv opcodes.

actually even if we did have a Vector regfile the problem still exists because of the way that the union typedef works.
 
> Then, the "endianess of the register file", and "VL indexing direction"
> should become totally transparent (architecturally invisible). We can choose
> one mode (say LE) and stick to it.

this was initially the decision made by RISC-V RVV.  that if the "parameters" change the contents of the Vector regfile are actually wiped out.  however within a year multiple people explained to them that the extra overhead involved in transferring between scalar and vector regfiles as well as the extra cost of the vector regfile SRAM was too great for some implementors.

consequently they provided a "fit on top of FP" mode and had to define - just as we are needing to do - a precise and exact ordering of the entire SRAM of the regfile(s).

of course, they don't have to deal with IBM numbering *sigh*

> Just my two cents.
> 
> I do admit that, as I reread the thread, I'm still thoroughly confused.

it's why i'm very very reluctant to go messing with it.   five months to get LD/ST right, four months to get CR operations, mtcr and mfocr right.

Comment 54 Luke Kenneth Casson Leighton 2021-01-04 16:38:08 GMT

alexandre i have another way to help you think this through.  what behaviour do you expect - in the scalar implementation - of the HDL to perform when transferring data from memory into the register file:

* ld     LE: ordering of the MEANING in regfile is in {insert order}
* ldbrx  LE: ordering of the MEANING in regfile is in {insert order}
* ld     BE: ordering of the MEANING in regfile is in {insert order}
* ldbrx  BE: ordering of the MEANING in regfile is in {insert order}

following on from that, what behaviour do you expect in the ALUs in each case?
specifically: how should the ALUs read and write the data to and from the regfile?

there should be (potentially) two orders at that point (two "meanings" as you term them).

* following {insert order above} into regfile, ALUs read/write in {insert order}
* following {insert order above} into regfile, ALUs read/write in {insert order}

Comment 55 Jacob Lifshay 2021-01-04 20:00:45 GMT

(In reply to Luke Kenneth Casson Leighton from comment #53)
> (In reply to Alexandre Oliva from comment #50)
> > do you see now why it doesn't make sense that the conversion from BE to LE
> > (or vice versa) places the MSByte in the LSByte?
> 
> deep breath: it doesn't matter if it "makes sense", it's what the actual
> code - the simulator, the HDL of microwatt and the HDL of Libre-SOC -
> actually do.

Yes, and Alexandre and I are saying that the CPU should be changed for the software reasons explained previously:

registers should keep their contents conceptually in the data endian mode the cpu is currently in, all arithmetic operations should byteswap from the registers' endian mode to LE (or whatever endian the ALU is implemented in) for operations byteswapping at the element-size for the operation, then back to the data endian mode to store the results in registers.

In BE mode, all of those byteswaps swap between BE in the registers and LE for the ALUs.

In LE mode, all of those byteswaps swap between LE in the registers and LE for the ALUs -- here the byteswaps are actually no-ops.

Switching between BE and LE mode by flipping the mode bit in the appropriate SPR will byteswap all registers at 64-bit width in order to keep their values for OpenPower v3.x compatibility.

All the above is how it looks to the ISA-level programmer. The hardware can and should implement it differently but it has to end up looking like the above.

Possible implementation: Byte-swapping networks are added to ALUs that can swap at all sizes the ALUs support (so a 16-bit op tells the byte-swapper to swap or not at 16-bit size, a 32-bit op tells the byte-swapper to swap or not at 32-bit size, and so on for other sizes), the data busses and argument/result latches store results in the CPUs current endian mode.
the registers are always stored in LE mode for 64-bit values, the register R/W ports byteswap at 64-bit size if the data-endian mode SPR bit is BE, otherwise are passthroughs.

changing the data endian mode SPR bit causes a total pipeline flush, flips the bit (enabling/disabling the 64-bit byteswapper at the registers), then resumes the CPU.

the PC and SPR registers are always kept in LE form, since they are not byte-addressable, making the CPU simpler. instructions that copy ISA-level register values from/to SPRs (or PC) will byteswap at the appropriate size as part of getting the actual value of the register from the internal busses/latches.

Comment 56 Luke Kenneth Casson Leighton 2021-01-05 00:48:23 GMT

(In reply to Jacob Lifshay from comment #55)
> (In reply to Luke Kenneth Casson Leighton from comment #53)
> > (In reply to Alexandre Oliva from comment #50)
> > > do you see now why it doesn't make sense that the conversion from BE to LE
> > > (or vice versa) places the MSByte in the LSByte?
> > 
> > deep breath: it doesn't matter if it "makes sense", it's what the actual
> > code - the simulator, the HDL of microwatt and the HDL of Libre-SOC -
> > actually do.
> 
> Yes, and Alexandre and I are saying that the CPU should be changed for the
> software reasons explained previously:

the hardware is already so complex and this introduces another dimension of complexity that it is going to be one of those things that is simply not a good idea to continue discussing for implementation at this time.

changes involving the register files are, as you have noted many times already, where i put my foot down and say "no" due to the inherent complexity involved in even beginning to assess, starting from the discussion and escalating from there.

i have also made it clear a number of times how far behind we are and how urgent it is that fundamental architectural design changes stop being added.  "leaf node" ideas not a problem: massive impacting core design changes like this one, they should have been brought up 18 months ago.

we are unfortunately here hindered further in the discussion by not having nailed down a common frame of reference.

i did a quick check: both x86 and ARM do not do this.  ARM NEON LD/ST can perform byteswapping based on endianness: they do NOT allow the endianness optionally to propagate to the ALUs.  x86 SIMD chose LE and that's that.

 
> registers should keep their contents conceptually in the data endian mode
> the cpu is currently in,

the cost to an architecture in doing that is just insane.  registers effectively need to be "tagged" (context propagated and saved or inferred somehow) and/or byteswapped prior to use in the ALU.

this is insane.  to cover all SIMD permutations every register port needs an 8-to-8 8-bit crossbar in front of it, and we have a huge number of regfile ports.

that is not happening (as in: i am saying: it's not going to happen)



> all arithmetic operations should byteswap from the
> registers' endian mode to LE (or whatever endian the ALU is implemented in)
> for operations byteswapping at the element-size for the operation, then back
> to the data endian mode to store the results in registers.

no.

absolutely no way.  due to the ALUs being SIMD you're asking for full 64 bit 8x8 crossbars on either every FU or on every register file port, because it's not just 64 bit that needs 8-byte swapping, it's *all possible permutations* of 8, 16, 32 and 64 bit data that need bytereversing.

it is 100% the case that that is not goung into the design or the silicon.


> In BE mode, all of those byteswaps swap between BE in the registers and LE
> for the ALUs.

no.  far too costly both in gate count, assessment time, evaluation time, specification writing time: everything about this screams "no".


> In LE mode, all of those byteswaps swap between LE in the registers and LE
> for the ALUs -- here the byteswaps are actually no-ops.

wires.  that's how it's going to be.

> Switching between BE and LE mode by flipping the mode bit in the appropriate
> SPR will byteswap all registers at 64-bit width in order to keep their
> values for OpenPower v3.x compatibility.

if this was remotely workable (not insanely costly a gate count) i would say "yes, and we defer it"

however due to us doing SIMD ALUs it is not even the case that a straight 64 bit mux swapping order of 8 bytes or not can do the job: it can't.

what you're asking for is a full 8-in 8-out crossbar and the gate count on such is so enormous (10 gates per 2x2 crossbar, 3 layers, 64 bits, a total of 1920 gates *per regfile port* and we will easily end up with well north of 30-40 regfile ports) that the answer has to be no.

the reason why it is acceptable in LD/ST is because the byteswapping is isolated to the LDST units.

*not* on the front of every single regfile port.

answer: no.  this is not going to happen.

i'll wait for alexandre, so that you have opportunity to understand here how bytereversing works in the HDL and removes the ordering at the memory layer.

after that i would like this bugreport firmly closed as "WONTFIX" or "INVALID".

if there exist any architecture that does this type of register-exposed byteswapping in the SIMD units then it can be reopened.

preliminary indications are that absolutely nobody does this.  which, in terms of gcc support, makes it far more work to support such an architectural augmentation rather than less.

ARM NEON does the exact same trick that has been described in the SV spec:

https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Application-Level-Memory-Model/Endian-support/Endianness-in-Advanced-SIMD?lang=en

note how the diagram works.

* BE memory load performs byteswapping.  data is stored in the regfile in LE

* LE memory load is straight.  data is stored in the regfile in LE.

in both cases data is stored in LE, and processed in LE.  this is the sane way to do it.  preserving the memory order and expecting the ALUs to cope is insane and costly (and without precedent in the industry and in gcc)

with NEON doing SIMD regs only in LE, and with x86 doing SIMD regs only in LE, these are precedents that allow exploration or how their support in gcc is implemented.

Comment 57 Luke Kenneth Casson Leighton 2021-01-05 01:01:53 GMT

(In reply to Cesar Strauss from comment #52)
> It seems to me that, as long as:
> 
> 1) we rigorously stick to vector (SVP64, SUBVL) load, stores and operations
> on vector registers,
> 2) stick to predication to access its sub-elements,
> 3) do not use non-SVP64 instructions on register previously used as vectors
> and vice-versa,
> 4) do not change SUBVL on the same vector register
> 
> Then, the "endianess of the register file", and "VL indexing direction"
> should become totally transparent (architecturally invisible).

right.  it occurred to me to point out rhat it is perfectly reasonable for the compiler to make these restrictions/assumptions in order to simplify the implementation.

then at a later date optimisations may be applied once the (suboptimal) core is understood.

however to limit the hardware so that is incapable of supporting the above flexibility, this does not necessarily follow.

here is a really useful discussion about x86 SIMD:

https://stackoverflow.com/questions/24045102/how-does-endianness-work-with-simd-registers

the contributors conclude that x86 SIMD registers have an "implicit endianness" and that endianness is LE.

by copying the industry standard conventions it is easy to follow what gcc does for those architectures and do the same thing.

Comment 58 Jacob Lifshay 2021-01-05 01:29:01 GMT

I found a pretty good explanation of a similar issue on Arm NEON on big-endian:
https://llvm.org/docs/BigEndianNEON.html

Comment 59 Luke Kenneth Casson Leighton 2021-01-05 01:49:34 GMT

(In reply to Jacob Lifshay from comment #58)
> I found a pretty good explanation of a similar issue on Arm NEON on
> big-endian:
> https://llvm.org/docs/BigEndianNEON.html

ahh, that's a really good article.  describes the problem well, illustrates what bitcasting is (which i was too distracted by the confusion over ldbrx/ld to understand), allows me to understand that NEON SIMD registers are exactly the same as SV regfile layout (LE ordering), and provides a solution.

they don't *like* the solution (use LD1) but it is a solution.

trying to solve this one in hardware is just too much.  following the exact same path as described there for NEON in LLVM will do the job without introducing horrendous hardware cost and complexity.

Comment 60 Alexandre Oliva 2021-01-05 03:12:50 GMT

Ok, here we go.  According to v3.0B p25, a doubleword 0x2122_2324_2526_2728 is represented in memory like this:

endianness: { ([offset] = value,)* };

      from most to least significant
BE: { [0] = 0x21, [1] = 0x22, [2] = 0x23, [3] = 0x24, [4] = 0x25, [5] = 0x26, [6] = 0x27, [7] = 0x28, };

      from most to least significant
LE: { [7] = 0x21, [6] = 0x22, [5] = 0x23, [4] = 0x24, [3] = 0x25, [2] = 0x26, [1] = 0x27, [0] = 0x28, };

which could also be equivalently depicted like this:

      from least to most significant
LE: { [0] = 0x28, [1] = 0x27, [2] = 0x26, [3] = 0x25, [4] = 0x24, [5] = 0x23, [6] = 0x22, [7] = 0x21, };

just like the earlier one could be equivalently depicted like this:

      from least to most significant:
BE: { [7] = 0x28, [6] = 0x27, [5] = 0x26, [4] = 0x25, [3] = 0x24, [2] = 0x23, [1] = 0x22, [0] = 0x21, };


> * ld     LE: ordering of the MEANING in regfile is in {insert order}
> * ld     BE: ordering of the MEANING in regfile is in {insert order}

if you load from memory, with the CPU configured to assume memory is in the same endianness as the representation you're reading from, what you get into the register can be most trivially represented as 0x2122_2324_2526_2728.  it could also be represented as 0x21 * 2^{56} + 0x22 * 2^{48} + 0x23 * 2^{40} + 0x24 * 2^{32} + 0x25 * 2^{24} + 0x26 * 2^{16} + 0x27 * 2^8 + 0x28 * 2^0.  or, if you're looking for a bitwise representation, here's one:

0 * 2^{0} + 0 * 2^{1} + 0 * 2^{2} + 1 * 2^{3} + 0 * 2^{4} + 1 * 2^{5} + 0 * 2^{6} + 0 * 2^{7} + 1 * 2^{8} + 1 * 2^{9} + 1 * 2^{10} + 0 * 2^{11} + 0 * 2^{12} + 1 * 2^{13} + 0 * 2^{14} + 0 * 2^{15} + 0 * 2^{16} + 1 * 2^{17} + 1 * 2^{18} + 0 * 2^{19} + 0 * 2^{20} + 1 * 2^{21} + 0 * 2^{22} + 0 * 2^{23} + 1 * 2^{24} + 0 * 2^{25} + 1 * 2^{26} + 0 * 2^{27} + 0 * 2^{28} + 1 * 2^{29} + 0 * 2^{30} + 0 * 2^{31} + 0 * 2^{32} + 0 * 2^{33} + 1 * 2^{34} + 0 * 2^{35} + 0 * 2^{36} + 1 * 2^{37} + 0 * 2^{38} + 0 * 2^{39} + 1 * 2^{40} + 1 * 2^{41} + 0 * 2^{42} + 0 * 2^{43} + 0 * 2^{44} + 1 * 2^{45} + 0 * 2^{46} + 0 * 2^{47} + 0 * 2^{48} + 1 * 2^{49} + 0 * 2^{50} + 0 * 2^{51} + 0 * 2^{52} + 1 * 2^{53} + 0 * 2^{54} + 0 * 2^{55} + 1 * 2^{56} + 0 * 2^{57} + 0 * 2^{58} + 0 * 2^{59} + 0 * 2^{60} + 1 * 2^{61} + 0 * 2^{62} + 0 * 2^{63}

here's another:

0 * 2^{63} + 0 * 2^{62} + 1 * 2^{61} + 0 * 2^{60} + 0 * 2^{59} + 0 * 2^{58} + 0 * 2^{57} + 1 * 2^{56} + 0 * 2^{55} + 0 * 2^{54} + 1 * 2^{53} + 0 * 2^{52} + 0 * 2^{51} + 0 * 2^{50} + 1 * 2^{49} + 0 * 2^{48} + 0 * 2^{47} + 0 * 2^{46} + 1 * 2^{45} + 0 * 2^{44} + 0 * 2^{43} + 0 * 2^{42} + 1 * 2^{41} + 1 * 2^{40} + 0 * 2^{39} + 0 * 2^{38} + 1 * 2^{37} + 0 * 2^{36} + 0 * 2^{35} + 1 * 2^{34} + 0 * 2^{33} + 0 * 2^{32} + 0 * 2^{31} + 0 * 2^{30} + 1 * 2^{29} + 0 * 2^{28} + 0 * 2^{27} + 1 * 2^{26} + 0 * 2^{25} + 1 * 2^{24} + 0 * 2^{23} + 0 * 2^{22} + 1 * 2^{21} + 0 * 2^{20} + 0 * 2^{19} + 1 * 2^{18} + 1 * 2^{17} + 0 * 2^{16} + 0 * 2^{15} + 0 * 2^{14} + 1 * 2^{13} + 0 * 2^{12} + 0 * 2^{11} + 1 * 2^{10} + 1 * 2^{9} + 1 * 2^{8} + 0 * 2^{7} + 0 * 2^{6} + 1 * 2^{5} + 0 * 2^{4} + 1 * 2^{3} + 0 * 2^{2} + 0 * 2^{1} + 0 * 2^{0}

and here's yet another:

0 * 2^{7} + 0 * 2^{32} + 1 * 2^{24} + 0 * 2^{2} + 0 * 2^{33} + 0 * 2^{6} + 0 * 2^{25} + 0 * 2^{57} + 0 * 2^{62} + 0 * 2^{22} + 1 * 2^{53} + 0 * 2^{42} + 0 * 2^{60} + 0 * 2^{47} + 0 * 2^{1} + 1 * 2^{45} + 1 * 2^{18} + 1 * 2^{26} + 0 * 2^{44} + 0 * 2^{52} + 0 * 2^{58} + 0 * 2^{4} + 0 * 2^{16} + 0 * 2^{0} + 0 * 2^{23} + 1 * 2^{49} + 1 * 2^{3} + 0 * 2^{38} + 0 * 2^{35} + 0 * 2^{11} + 0 * 2^{46} + 0 * 2^{28} + 1 * 2^{13} + 0 * 2^{19} + 1 * 2^{8} + 0 * 2^{50} + 0 * 2^{30} + 0 * 2^{51} + 0 * 2^{43} + 0 * 2^{12} + 1 * 2^{21} + 1 * 2^{17} + 1 * 2^{10} + 0 * 2^{36} + 0 * 2^{39} + 1 * 2^{61} + 0 * 2^{48} + 1 * 2^{9} + 1 * 2^{41} + 1 * 2^{56} + 0 * 2^{59} + 0 * 2^{14} + 0 * 2^{55} + 0 * 2^{27} + 1 * 2^{34} + 1 * 2^{37} + 0 * 2^{20} + 0 * 2^{63} + 0 * 2^{31} + 1 * 2^{29} + 1 * 2^{5} + 0 * 2^{54} + 0 * 2^{15} + 1 * 2^{40}

they're all equivalent, of course.  note how the order of presentation of the bits doesn't change the significance of the bits (the exponent associated with the bit), nor the aggregate value the bits represent together.

if loading a value from memory changed the aggregate value, it would be about as bad as if loading a vector from memory changed the order of its elements.


How am I doing so far?

Comment 61 Jacob Lifshay 2021-01-05 03:25:52 GMT

(In reply to Alexandre Oliva from comment #60)
> How am I doing so far?

Looks correct to me :)

Comment 62 Jacob Lifshay 2021-01-05 03:40:40 GMT

(In reply to Luke Kenneth Casson Leighton from comment #56)
> (In reply to Jacob Lifshay from comment #55)
> > (In reply to Luke Kenneth Casson Leighton from comment #53)
> > > (In reply to Alexandre Oliva from comment #50)
> > > > do you see now why it doesn't make sense that the conversion from BE to LE
> > > > (or vice versa) places the MSByte in the LSByte?
> > > 
> > > deep breath: it doesn't matter if it "makes sense", it's what the actual
> > > code - the simulator, the HDL of microwatt and the HDL of Libre-SOC -
> > > actually do.
> > 
> > Yes, and Alexandre and I are saying that the CPU should be changed for the
> > software reasons explained previously:
> 
> the hardware is already so complex and this introduces another dimension of
> complexity that it is going to be one of those things that is simply not a
> good idea to continue discussing for implementation at this time.
> 
> changes involving the register files are, as you have noted many times
> already, where i put my foot down and say "no" due to the inherent
> complexity involved in even beginning to assess, starting from the
> discussion and escalating from there.

Note that the HW implementation I proposed in comment #55 would require a 5-input mux on ALU pipelines inputs/outputs and a 2-input mux on register R/W ports.

Using a slight variation (busses and result latches always in LE, ALUs byteswap differently to compensate) of that design that I haven't completely thought through, it can be reduced to a 4-input mux on ALU pipelines inputs/outputs and no mux on the registers.

Comment 63 Jacob Lifshay 2021-01-05 03:49:09 GMT

(In reply to Jacob Lifshay from comment #62)
> (In reply to Luke Kenneth Casson Leighton from comment #56)
> > (In reply to Jacob Lifshay from comment #55)
> > > (In reply to Luke Kenneth Casson Leighton from comment #53)
> > > > (In reply to Alexandre Oliva from comment #50)
> > > > > do you see now why it doesn't make sense that the conversion from BE to LE
> > > > > (or vice versa) places the MSByte in the LSByte?
> > > > 
> > > > deep breath: it doesn't matter if it "makes sense", it's what the actual
> > > > code - the simulator, the HDL of microwatt and the HDL of Libre-SOC -
> > > > actually do.
> > > 
> > > Yes, and Alexandre and I are saying that the CPU should be changed for the
> > > software reasons explained previously:
> > 
> > the hardware is already so complex and this introduces another dimension of
> > complexity that it is going to be one of those things that is simply not a
> > good idea to continue discussing for implementation at this time.
> > 
> > changes involving the register files are, as you have noted many times
> > already, where i put my foot down and say "no" due to the inherent
> > complexity involved in even beginning to assess, starting from the
> > discussion and escalating from there.
> 
> Note that the HW implementation I proposed in comment #55 would require a
> 5-input mux on ALU pipelines inputs/outputs and a 2-input mux on register
> R/W ports.

I think the 5-input mux is a far cry from the "full 8-in 8-out crossbar" that you were afraid of needing. it would be 6*64 gates (4 2-in NAND and 1 5-in NAND) per 64-bit input/output with a 2 gate delay. I'd expect that to be small enough to be doable.

> Using a slight variation (busses and result latches always in LE, ALUs
> byteswap differently to compensate) of that design that I haven't completely
> thought through, it can be reduced to a 4-input mux on ALU pipelines
> inputs/outputs and no mux on the registers.

Note that, in the slight variation mentioned above, anything that does operations at something other than 64-bit element size (mostly just partitionable ALUs) would need the ALU byteswapping, pipelines that only operate at 64-bit element size are like the register ports and wouldn't require any byteswapping.

it would be 5*64 gates (3 2-in NAND and 1 4-in NAND) per 64-bit input/output with a 2 gate delay. I'd expect that to be small enough to be easily doable.

Comment 64 Alexandre Oliva 2021-01-05 03:51:49 GMT

* ldbrx  LE: ordering of the MEANING in regfile is in {insert order}
* ldbrx  BE: ordering of the MEANING in regfile is in {insert order}

if data in memory is laid out in a way opposite to the selected CPU endianness, with ldbrx you get the register set up as depicted in comment 60.


if it's laid out in a way that matches the selected CPU endianness,
you get these bits into the register:

1 * 2^{0} + 0 * 2^{1} + 0 * 2^{2} + 0 * 2^{3} + 0 * 2^{4} + 1 * 2^{5} + 0 * 2^{6} + 0 * 2^{7} + 0 * 2^{8} + 1 * 2^{9} + 0 * 2^{10} + 0 * 2^{11} + 0 * 2^{12} + 1 * 2^{13} + 0 * 2^{14} + 0 * 2^{15} + 1 * 2^{16} + 1 * 2^{17} + 0 * 2^{18} + 0 * 2^{19} + 0 * 2^{20} + 1 * 2^{21} + 0 * 2^{22} + 0 * 2^{23} + 0 * 2^{24} + 0 * 2^{25} + 1 * 2^{26} + 0 * 2^{27} + 0 * 2^{28} + 1 * 2^{29} + 0 * 2^{30} + 0 * 2^{31} + 1 * 2^{32} + 0 * 2^{33} + 1 * 2^{34} + 0 * 2^{35} + 0 * 2^{36} + 1 * 2^{37} + 0 * 2^{38} + 0 * 2^{39} + 0 * 2^{40} + 1 * 2^{41} + 1 * 2^{42} + 0 * 2^{43} + 0 * 2^{44} + 1 * 2^{45} + 0 * 2^{46} + 0 * 2^{47} + 1 * 2^{48} + 1 * 2^{49} + 1 * 2^{50} + 0 * 2^{51} + 0 * 2^{52} + 1 * 2^{53} + 0 * 2^{54} + 0 * 2^{55} + 0 * 2^{56} + 0 * 2^{57} + 0 * 2^{58} + 1 * 2^{59} + 0 * 2^{60} + 1 * 2^{61} + 0 * 2^{62} + 0 * 2^{63}

AKA:

0 * 2^{63} + 0 * 2^{62} + 1 * 2^{61} + 0 * 2^{60} + 1 * 2^{59} + 0 * 2^{58} + 0 * 2^{57} + 0 * 2^{56} + 0 * 2^{55} + 0 * 2^{54} + 1 * 2^{53} + 0 * 2^{52} + 0 * 2^{51} + 1 * 2^{50} + 1 * 2^{49} + 1 * 2^{48} + 0 * 2^{47} + 0 * 2^{46} + 1 * 2^{45} + 0 * 2^{44} + 0 * 2^{43} + 1 * 2^{42} + 1 * 2^{41} + 0 * 2^{40} + 0 * 2^{39} + 0 * 2^{38} + 1 * 2^{37} + 0 * 2^{36} + 0 * 2^{35} + 1 * 2^{34} + 0 * 2^{33} + 1 * 2^{32} + 0 * 2^{31} + 0 * 2^{30} + 1 * 2^{29} + 0 * 2^{28} + 0 * 2^{27} + 1 * 2^{26} + 0 * 2^{25} + 0 * 2^{24} + 0 * 2^{23} + 0 * 2^{22} + 1 * 2^{21} + 0 * 2^{20} + 0 * 2^{19} + 0 * 2^{18} + 1 * 2^{17} + 1 * 2^{16} + 0 * 2^{15} + 0 * 2^{14} + 1 * 2^{13} + 0 * 2^{12} + 0 * 2^{11} + 0 * 2^{10} + 1 * 2^{9} + 0 * 2^{8} + 0 * 2^{7} + 0 * 2^{6} + 1 * 2^{5} + 0 * 2^{4} + 0 * 2^{3} + 0 * 2^{2} + 0 * 2^{1} + 1 * 2^{0}

AKA:

0x2827262524232221 = 2893323226570760737

these are also the values you get into a register if you use ld to transfer from memory data that's laid out with the endianness opposite to the one selected in the CPU

it's worth pointing out that CPU endianness and ldbrx reverses bytes, but not bits as one might expect.  that's not entirely surprising, considering that bytes are the minimum addressable unit in memory, and opcodes that select, shift and otherwise identify bits operate on a basis of significance rather than endianness.  e.g., shifting left moves bits to more significant positions, whether the CPU is in big or little endian mode.

Comment 65 Alexandre Oliva 2021-01-05 04:21:45 GMT

now let's look at the implications of byte endianness for vectors of bytes, shall we?

v8qi x = { [0] = 0x21, [1] = 0x22, [2] = 0x23, [3] = 0x24, [4] = 0x25, [5] = 0x26, [6] = 0x27, [7] = 0x28 };

regardless of endianness, ((char*)&x)[0] == 0x21, and ((char*)&x)[7] == 0x28
same as if we had a char[8]:

char y[8] = { [0] = 0x21, [1] = 0x22, [2] = 0x23, [3] = 0x24, [4] = 0x25, [5] = 0x26, [6] = 0x27, [7] = 0x28 };

the expectation is that if you memcpy between an array and a vector, and vice-versa, elements remain in the same order.  a different memory representation for vectors would break this.


the normal expectation is that, if data is laid out in memory in accordance with the selected endianness, it is loaded from memory into registers using regular loads, rather than with byte-reversing loads, right?


if we load the byte vector element-wise, we get regX[0] = 0x21 and regX[7] = 0x28, *regardless* of how we lay out the elements within the register.

this is good, and is the best you can do if data is not more aligned than strictly needed, or if vector sizes might be clamped at less than 8 byte-sized elements.


but from a performance standpoint, loading the bytes separately is quite inefficient.  it would be far more desirable to be able to load dwords rather than bytes, since that's 8 bytes per effective instruction, instead of just 1.

so, what does it take to get the iteration within the register to enable wide memory loads in the CPU/system-selected endianness?

in LE, the x above gets loaded by ld as 0x2827262524232221, as in comment 64 so, in order for the iteration order to match the declaration, element 0 is at bits 2^0..2^7, element 1 is at bits 2^8..2^{15}, and so on.

in BE, however, the x above gets loaded by ld as 0x2122232425262728, as in comment 60, so, in order for the iteration order to match the declaration, element 0 has to be at bits 2^{56}..2^{63}, element 1 at bits 2^{48}..2^{55}, and so on.

this is not reversing anything, not introducing any computations anywhere, it's just making the svp64 loop iterate over multiple elements in the same register in the same order you'd go over them if they were in memory.

that's all am I suggesting.  not any of the other modifications you're alluding to and apparently freaking out about.  can you please describe the change I'm suggesting, in your own words, so that we can make sure we're not talking way past each other any more?

Comment 66 Luke Kenneth Casson Leighton 2021-01-05 04:49:52 GMT

(In reply to Alexandre Oliva from comment #60)
> Ok, here we go.  According to v3.0B p25, a doubleword 0x2122_2324_2526_2728
> is represented in memory like this:
> 
> endianness: { ([offset] = value,)* };
> 
>       from most to least significant
> BE: { [0] = 0x21, [1] = 0x22, [2] = 0x23, [3] = 0x24, [4] = 0x25, [5] =
> 0x26, [6] = 0x27, [7] = 0x28, };
> 
>       from most to least significant
> LE: { [7] = 0x21, [6] = 0x22, [5] = 0x23, [4] = 0x24, [3] = 0x25, [2] =
> 0x26, [1] = 0x27, [0] = 0x28, };

ok, so that describes memory-to-number, where number is in a well-understood convention from computer science (a hexadecimal representation).

now.

how would, in a scalar v3.0B OpenPOWER processor, you expect that number be stored in the regfile, in each of the 4 cases?

Comment 67 Alexandre Oliva 2021-01-05 05:00:38 GMT

and just in case you're unconvinced because for 8-bit vector elements there's byte-reversing load, consider 16-bit and 32-bit vector elements, including floating-point ones; there aren't hword- and word-reversing loads, are there?

yes, loading one element at a time would still work, but we're talking about optimizing the code for better performance, abiding by the documented layouts, and avoiding making things gratuitously harder for either endianness.

Comment 68 Luke Kenneth Casson Leighton 2021-01-05 05:05:54 GMT

(In reply to Jacob Lifshay from comment #62)

> Note that the HW implementation I proposed in comment #55 would require a
> 5-input mux on ALU pipelines inputs/outputs and a 2-input mux on register
> R/W ports.

no, this is plain wrong, and is misleading alexandre who is not familiar with gate level design and assessment.

you neglected to mention that those muxes are 64 bits wide.  consequently he believes that the gate count is only 5.

alexandre: 2 way muxes for one single bit are around 5 gates.  however a mux is not required here, a crossbar is.  a 2-2 crossbar is 10 gates (basically 2 muxes).  that's per pair of bits.  look up "butterfly network"

also, it's nowhere near as simple as you believe it to be.  plus, muxes are inadequate.

we are doing 64 bit SIMD backend ALUs.  if the data in the regfile is not converted to a sane format, then the conversion of every element must be carried out at the regfile port.

we are designing Dynamic Partitioned SIMD and consequently the 64 bit SIMD must now  have all possible permutations of byteswapping at the regfile port.

to cover all possibilities we must first enumerate those possibilities.  they are:

* 8 8 8 8 8 8 8 8
* 16 8 8 8 8 8
* 8 16 8 8 8
* ....
* 16 16 ... 
* 24 8 8 8 ..
* 8 8 ... 16 8..

finally at long last you get to 1x 64 bit. 
total: 128 combinations

to cover all of these REQUIRES a full 8x8 crossbar.

the most efficient method for this is a butterfly network.

and that takes around the 2,000 gates mark for 64 bit 8x8

please understand and accept that the decision has been made: there is just no way this is practical.

it's not going in.  i can keep on repeating this as many times as it takes to be accepted: bear in mind that every time i have to repeat it we are wasting more and more valuable time.

Comment 69 Alexandre Oliva 2021-01-05 05:10:45 GMT

there's only one bit pattern that represents that number in binary, and that's:

0 * 2^{0} + 0 * 2^{1} + 0 * 2^{2} + 1 * 2^{3} + 0 * 2^{4} + 1 * 2^{5} + 0 * 2^{6} + 0 * 2^{7} + 1 * 2^{8} + 1 * 2^{9} + 1 * 2^{10} + 0 * 2^{11} + 0 * 2^{12} + 1 * 2^{13} + 0 * 2^{14} + 0 * 2^{15} + 0 * 2^{16} + 1 * 2^{17} + 1 * 2^{18} + 0 * 2^{19} + 0 * 2^{20} + 1 * 2^{21} + 0 * 2^{22} + 0 * 2^{23} + 1 * 2^{24} + 0 * 2^{25} + 1 * 2^{26} + 0 * 2^{27} + 0 * 2^{28} + 1 * 2^{29} + 0 * 2^{30} + 0 * 2^{31} + 0 * 2^{32} + 0 * 2^{33} + 1 * 2^{34} + 0 * 2^{35} + 0 * 2^{36} + 1 * 2^{37} + 0 * 2^{38} + 0 * 2^{39} + 1 * 2^{40} + 1 * 2^{41} + 0 * 2^{42} + 0 * 2^{43} + 0 * 2^{44} + 1 * 2^{45} + 0 * 2^{46} + 0 * 2^{47} + 0 * 2^{48} + 1 * 2^{49} + 0 * 2^{50} + 0 * 2^{51} + 0 * 2^{52} + 1 * 2^{53} + 0 * 2^{54} + 0 * 2^{55} + 1 * 2^{56} + 0 * 2^{57} + 0 * 2^{58} + 0 * 2^{59} + 0 * 2^{60} + 1 * 2^{61} + 0 * 2^{62} + 0 * 2^{63}

I can imagine that, by confusing presentation order with significance, you may come to a notion that the in-register representation of a number could vary depending on endianness.  it doesn't, because endianness is thrown out the window in a register.  the wires and flip-flops holding the bits may be shuffled along with their significances however you like, and the represented number won't change, any more than it would if you held a soroban in front of a mirror, or wrote a date as MM/DD or DD/MM.

now, you say there are 4 cases, but even in memory there are only two endiannesses; what other binary choice you're adding to the in-memory array/vector representations to make it 4 cases rather than two?

even with byte-reversing loads, for which you enumerated 4 cases, it's reduced to two cases, because two reversals cancel out; and if you came up with a notion of endianness that made sense for a register, rather than memory, that would make for 8 cases, not 4; so would you please spell out the 4 cases you had in mind in your question?

Comment 70 Alexandre Oliva 2021-01-05 05:13:13 GMT

I'm not at all concerned with muxes, my suggestion has nothing to do with them, but I am familiar with hardware gates and muxes and that sort of stuff, thank you very much.

my suggestion is about iteration order only.  since in a vector you'll normally iterate over all elements, you need the gates to select each of the sub-units.  the only thing that changes is which sub-units you visit first.

Comment 71 Cole Poirier 2021-01-05 05:13:52 GMT

(In reply to Alexandre Oliva from comment #65)
> now let's look at the implications of byte endianness for vectors of bytes,
> shall we?
> 
> v8qi x = { [0] = 0x21, [1] = 0x22, [2] = 0x23, [3] = 0x24, [4] = 0x25, [5] =
> 0x26, [6] = 0x27, [7] = 0x28 };
> 
> regardless of endianness, ((char*)&x)[0] == 0x21, and ((char*)&x)[7] == 0x28
> same as if we had a char[8]:
> 
> char y[8] = { [0] = 0x21, [1] = 0x22, [2] = 0x23, [3] = 0x24, [4] = 0x25,
> [5] = 0x26, [6] = 0x27, [7] = 0x28 };
> 
> the expectation is that if you memcpy between an array and a vector, and
> vice-versa, elements remain in the same order.  a different memory
> representation for vectors would break this.
> 
> 
> the normal expectation is that, if data is laid out in memory in accordance
> with the selected endianness, it is loaded from memory into registers using
> regular loads, rather than with byte-reversing loads, right?
> 
> 
> if we load the byte vector element-wise, we get regX[0] = 0x21 and regX[7] =
> 0x28, *regardless* of how we lay out the elements within the register.
> 
> this is good, and is the best you can do if data is not more aligned than
> strictly needed, or if vector sizes might be clamped at less than 8
> byte-sized elements.
> 
> 
> but from a performance standpoint, loading the bytes separately is quite
> inefficient.  it would be far more desirable to be able to load dwords
> rather than bytes, since that's 8 bytes per effective instruction, instead
> of just 1.
> 
> so, what does it take to get the iteration within the register to enable
> wide memory loads in the CPU/system-selected endianness?
> 
> in LE, the x above gets loaded by ld as 0x2827262524232221, as in comment 64
> so, in order for the iteration order to match the declaration, element 0 is
> at bits 2^0..2^7, element 1 is at bits 2^8..2^{15}, and so on.
> 
> in BE, however, the x above gets loaded by ld as 0x2122232425262728, as in
> comment 60, so, in order for the iteration order to match the declaration,
> element 0 has to be at bits 2^{56}..2^{63}, element 1 at bits
> 2^{48}..2^{55}, and so on.
> 
> this is not reversing anything, not introducing any computations anywhere,
> it's just making the svp64 loop iterate over multiple elements in the same
> register in the same order you'd go over them if they were in memory.
> 
> that's all am I suggesting.  not any of the other modifications you're
> alluding to and apparently freaking out about.  can you please describe the
> change I'm suggesting, in your own words, so that we can make sure we're not
> talking way past each other any more?

Luke, can you please respond exclusively to Alexandre’s above comment? You are addressing something that is not what he has written here and until this comment is addressed in isolation and point by point you will continue to misunderstand Alexandre.

Comment 72 Alexandre Oliva 2021-01-05 05:15:44 GMT

> it's not going in

please spell out what it is that's not going in.

there are several unrelated conversations going on in this bug, and you're not making clear what it is that you're rejecting.  it even suggests that you haven't been able to tell the suggestions apart.

Comment 73 Luke Kenneth Casson Leighton 2021-01-05 05:30:35 GMT

(In reply to Alexandre Oliva from comment #67)
> and just in case you're unconvinced because for 8-bit vector elements
> there's byte-reversing load, consider 16-bit and 32-bit vector elements,
> including floating-point ones; there aren't hword- and word-reversing loads,
> are there?

alexandre: this is so complex that it is critically important for you to understand the basics - the scalar case - before moving on to the vector one.

also it is important to accept that the decision has been made: we are doing what ARM has done for NEON which is exactly what intel did for MMX.

the only thing left for discussion is "why has that decision been made"

i will repeat it again to you both: there is absolutely in way in hell that we are adding 2000 gates worth of dynamic switchable bytereversing in front of 40 regfile ports.  this is a final and irrevocable decision for which further discussion along the lines of "are you sure, can i convince you otherwise" is 100% categorically going to meet with an emphatic and unchanging answer of "never".

the LLVM NEON link that jacob found provides the guidelines on the approach needed.

further fruitful discussion on this topic is best directed along the lines of, "so the decision has been made, the regfile is LE Ordered, just like in NEON and MMX, ok how do we fit things to that decision.  how do we clarify the spec to fit that decision.  how do we implement that decision in HDL and the SV simulators.  how do we best implement that decision in the compilers"

or, "i don't quite fully understand the decision, can you please explain it again"

so please: can i ask you, alexandre, to follow along step by step through the questions i am asking: do not add in vectors or any reference to vectors at all in any way until the base frame of reference - OpenPOWER v3.0B - has been understood.

there is an aspect of OpenPOWER that is critically important to understand and accept when it comes to implementing the HDL, without which, as i have said appx twice now, further discussion without a firm frame of reference is utterly pointless, because it will only lead to "i cannot understand because of assumption / miscommunication X".

please then: do not mention or discuss vectors further until scalar has been understood and accepted.

Comment 74 Luke Kenneth Casson Leighton 2021-01-05 05:37:00 GMT

(In reply to Alexandre Oliva from comment #72)
> > it's not going in
> 
> please spell out what it is that's not going in.

the topic of this thread (a dynamic LE-BE regfile)  the bugreport is to be closed as either INVALID or WONTFIX.

the decision, which is categorically final, is that the regfiles remain LE ordered meaning, just as is done in ARM NEON and Intel MMX.

the words "it's not going in" refer to Jacob's ideas of adding a full 8x8 bytelevel crossbar in front of every 64 bit regfile port in order to perform simultaneous multiple element byteswapping up to the SIMD ALU width of 64 bit.

Comment 75 Alexandre Oliva 2021-01-05 05:45:11 GMT

I thought I'd already shown that I understand how ld and ldbrx work.  What else remains so we can finish this exchange and get to my suggestion, that is not related with this?

> a dynamic LE-BE regfile

My suggestion of iteration order in sub-register vector elements is unrelated to this.

Should I file a separate bug for it so that you'll read it?

Comment 76 Luke Kenneth Casson Leighton 2021-01-05 06:06:32 GMT

(In reply to Cole Poirier from comment #71)

> Luke, can you please respond exclusively to Alexandre’s above comment?

no, and i will explain why (for the third time).

Alexandre you are jumping the gun by proceeding to the vector case before you have confirmed to me that you understand how OpenPOWER v3.0B scalar works at the HDL level.

without establishing that common frame of reference there are zero answers, there are only 100% disambiguating questions.



> You
> are addressing something that is not what he has written here and until this
> comment is addressed in isolation and point by point you will continue to
> misunderstand Alexandre.

no: i will continue to spend my time, just as i did in early comments, asking questions that point out ambiguities which, because there is no established common frame of reference, i am forced to ask, and which, because that common frame of reference is not yet established, any efforts at communication are 100% ineffective.

we have three if not four or five separate discussions going on here

1) how OpenPOWER v3.0B Scalar works.

Alexandre has not established to me that he understands this and how it is implemented in libresoc and microwatt

2) how SV's regfile arrangement works

this is LE and the decision here is categorical and final.  no amount of discussion is going to change that decision.

3) how both Jacob and Alexandre would *like* SV's regfile arrangement to work

here the ideas are legitimate concerns but the complexity and cost is so high it is just not going to happen.  not now, not ever.

continued efforts are ongoing to convince me of a decision that has already been made, and ultimately it was the NEON LLVM link that showed me that a software solution exists, and that no insertion of regfile byteswapping is required.

it's not a "nice" software solution but it's infinitely better than ******g up the hardware.

4) discussion of LE/BE vector ordering

these are just horribly confusing, having a minimum of *FOUR* separate and distinct ways in which LE/BE could be applied.

(absolutely none of which are worth discussing because the decision has already been made)


now, with so many separate discussions, none of which are clear (or being accepted) the opportunities for cross-purposes are just nuts.

the discussion needs to be drastically cut back.  if that cannot be done very quickly i am going to simply close - and lock - this bugreport.  i am not happy about that however i have made it pretty clear how far behind we are.

discussions like this one when it has been made clear about eight or nine times in the past 4 weeks that we are an entire year behind and need to get to implementation as fast as possible, this means being extremely disciplined and cutting stone dead unproductive discussions.

when we are this far behind i do not mind "how is this going to be implemented" decisions.

major, major redesign suggestions, these are OUT.

Comment 77 Luke Kenneth Casson Leighton 2021-01-05 06:30:20 GMT

(In reply to Alexandre Oliva from comment #75)
> I thought I'd already shown that I understand how ld and ldbrx work. 

i am barely keeping up and i did make it clear to keep the discussion to scalar only.  i saw something in amongst vectors and consequently did not spend time reading it because until OpenPOWER v3.0B scalar is understood there is literally no point.

> What
> else remains so we can finish this exchange and get to my suggestion, that
> is not related with this?

i forgot to ask: can you read VHDL? or is the explanation of why XNOR is used sufficient?

the purpose of asking you the questions that i did, limited to v3.0B scalar only, were designed to help you understand the relationship, in the HDL source, between

* Memory
* Number (hexadecimal)
* regfile
* ALU

the answers in each of the cases were - are - LE. it was a trick question


* ld     LE: ordering of the MEANING in regfile LE
* ldbrx  LE: ordering of the MEANING in regfile LE
* ld     BE: ordering of the MEANING in regfile LE
* ldbrx  BE: ordering of the MEANING in regfile LE

> > a dynamic LE-BE regfile
> 
> My suggestion of iteration order in sub-register vector elements is
> unrelated to this.

the time for suggestions is, unfortunately, over.  we were supposed to have had implemented SV in RISC-V about one year ago.  the RVF ******d that for us.

when you joined we were already behind by at least 12 months, not having had the time to even do the conversion of SV Prefix to OpenPOWER.

the only reason that the discussion of development of svp64 is taking place is because we're forced to, and forced to under some serious timecrunch.

the actual design of SV (the concept) was 18+ months ago, at which time suggestions would have been extremely welcome, and there woukd have been plenty of time to discuss and evaluate them.

now? right now? i'm really sorry to have to say this, but no.


 
> Should I file a separate bug for it so that you'll read it?

the best thing is to make sure that it is written, if not fully detailed at least some notes.

the reason is that whilst we move to implementation, it may turn out that we made an error or there was something we overlooked.

having a clear record of ideas gives us a reminder that we can go "ah! we discussed this before.  the solution was X".

rather than, "arse, we're screwed, we have to stop for 2 months and discuss redesigns"

so yes, please document it, i will look at it but unless it is in line with the decisions already made it will go into the "to look at again if we encounter a problem" pile.

this is how it has to be given the massive schedule delay and timecrunch we are under.

we are now moving from "spec wording review" mode into "draft spec freeze" mode so that we can get to "implement" mode within the next few days.

Comment 78 Alexandre Oliva 2021-01-05 06:32:15 GMT

> 3) how both Jacob and Alexandre would *like* SV's regfile arrangement to work

please state for the record how you imagine *I* would like it to work, since jumbling together what Jacob suggests and what I suggest seems to be leading to confusion.

references to the comments in which you based your imagination would be welcome.

hint: I have not suggested changes to the regfile arrangement.  not any.  not one.

if you somehow got the idea that I did, we've miscommunicated, and you've been opposing a strawman.



that you even think my understanding of ld and ldbrx insns is relevant WRT my suggestion goes to show how deep the miscommunication goes.


that you insist the register file is somehow little-endian is as nonsensical as insisting PPC bytes are big-endian regardless of the CPU endianness, or that the bits in it are yellow rather than purple.  it's the sort of statement that's "not even wrong": it starts from a fundamentally incorrect premise and arrives at nonsensical conclusions because of it, but one that's not disprovable because  it has no grounds in which to be assessed for truth.  that's so because there's no addressing order of the bits or of the bytes within a register to match or mismatch the significance order; all there is is the significance order, which doesn't imply any preferential iteration order for sub-units.

so, forget register endianness, even if just for the purposes of this conversation, because it's not relevant for it.  convince me that it's reasonable to deviate from the in-memory array iteration order that you've used as the reference in the documentation, *or* that we're not actually deviating from it.

Comment 79 Alexandre Oliva 2021-01-05 06:37:52 GMT

you may *think* the iteration order is already decided, but it really isn't.  the specification is incomplete.  it's ambiguous.  it works both ways.  but one has to be chosen.  and if you say this was chosen a year ago, that's just not true, because it's evident that the issue wasn't even considered, let alone decided on.  the possibilities are equally compatible with the existing specifications.

now, one of them will make for reasonable compiler implementation.

the other will make for huge compiler delays and risks and reworkings.

good luck getting funding and engineering to deal with the latter because you refused to think about it now, pretending you'd done it before.

Comment 80 Alexandre Oliva 2021-01-05 07:21:48 GMT

> can you read VHDL?

I had never tried before this interaction.   I've had no trouble whatsoever figuring it out.  It's very readable.  I can probably even write it sensibly, given a body of code to refer to to pick up the constructs from.

What I couldn't find, unsurprisingly, was specification of endianness of integers.

I couldn't find definitions for directions such as left and right, up and down, forward and backward either, or colors, either.  Just as expected.

Comment 81 Jacob Lifshay 2021-01-05 07:28:56 GMT

(In reply to Luke Kenneth Casson Leighton from comment #68)
> (In reply to Jacob Lifshay from comment #62)
> 
> > Note that the HW implementation I proposed in comment #55 would require a
> > 5-input mux on ALU pipelines inputs/outputs and a 2-input mux on register
> > R/W ports.
> 
> no, this is plain wrong, and is misleading alexandre who is not familiar
> with gate level design and assessment.
> 
> you neglected to mention that those muxes are 64 bits wide.  consequently he
> believes that the gate count is only 5.

I explicitly mentioned that the muxes take 5*64 gates in comment #63:
> I think the 5-input mux is a far cry from the "full 8-in 8-out crossbar" that
> you were afraid of needing. it would be 6*64 gates (4 2-in NAND and 1 5-in
> NAND) per 64-bit input/output with a 2 gate delay. I'd expect that to be
> small enough to be doable.


> we are designing Dynamic Partitioned SIMD and consequently the 64 bit SIMD
> must now  have all possible permutations of byteswapping at the regfile port.
> 
> to cover all possibilities we must first enumerate those possibilities. 
> they are:
> 
> * 8 8 8 8 8 8 8 8
> * 16 8 8 8 8 8
> * 8 16 8 8 8
> * ....
> * 16 16 ... 
> * 24 8 8 8 ..
> * 8 8 ... 16 8..
> 
> finally at long last you get to 1x 64 bit. 
> total: 128 combinations
> 
> to cover all of these REQUIRES a full 8x8 crossbar.

Yes, however, what happens if you only issue the same-sized elements to any one ALU each cycle:
8 8 8 8 8 8 8 8
16 16 16 16
32 32
64
but not:
8 8 16 32
or other non-uniform combinations.

At that point, the 5-input mux is sufficient (actually, only 4-inputs needed), since there's one input for not byte swapped, 1 for 16-bit byte-swapped, 1 for 32-bit byte-swapped, and 1 for 64-bit byte-swapped.

Since the most common vectors are 64-bits or longer, forcing the ALUs to only process same-sized elements per cycle is a reasonable tradeoff.

Note that that doesn't mean we have to wait for the 32-bit ops to finish making it through the pipline before we can issue 8-bit ops, they can be changed every cycle.

Comment 82 Jacob Lifshay 2021-01-05 09:47:51 GMT

(In reply to Jacob Lifshay from comment #81)
> Yes, however, what happens if you only issue the same-sized elements to any
> one ALU each cycle:
> 8 8 8 8 8 8 8 8
> 16 16 16 16
> 32 32
> 64
> but not:
> 8 8 16 32
> or other non-uniform combinations.
> 
> At that point, the 5-input mux is sufficient (actually, only 4-inputs
> needed), since there's one input for not byte swapped, 1 for 16-bit
> byte-swapped, 1 for 32-bit byte-swapped, and 1 for 64-bit byte-swapped.

Actually, if the 64-bit mux is instead partitioned into 8 8-bit muxes, we might be able to still support arbitrary combinations of 8/16/32/64-bit together, as long as elements are naturally aligned (16-bit is aligned to 16-bits, 32-bit is aligned to 32-bits, and so on):
Assuming registers/data buses are dynamically swapped between BE/LE depending on CPU data endian:
vector elements are still in the same order no matter their size, all that happens with byteswapping is the bytes within a element are swapped, no bytes swap past the element's boundaries. Therefore, the 4-input mux should still suffice since each individual byte can only be unswapped, 16-bit swapped, 32-bit swapped, or 64-bit swapped. no other combinations are supported/needed since they are ruled out by the natural alignment requirements.

Comment 83 Luke Kenneth Casson Leighton 2021-01-05 15:40:38 GMT

(In reply to Alexandre Oliva from comment #80)
> > can you read VHDL?
> 
> I had never tried before this interaction.   I've had no trouble whatsoever
> figuring it out.  It's very readable.  I can probably even write it
> sensibly, given a body of code to refer to to pick up the constructs from.

yes, it's very readable and clear, it took me 6 months to go through this same derivation process.  the microwatt team's expertise also shows in that they provide useful comments.
 

> What I couldn't find, unsurprisingly, was specification of endianness of
> integers.

ah this will be in the VHDL spec.  readers are assumed to know it.  from what i can gather/infer most HDLs perform arithmetic in LSB0 order, i.e. if performing addition the carry out from bit N goes into the input of N+1

however, not having read the spec, this LSB0 arithmetic meaning *may* be precisely because they use "downto".  however given that they do not use "upto" anywhere it is a nonissue.

(aside: this "meaning" of arithmetic in the HDL as it propagates through layer after layer is fundamentally where the decision to make the regfile LE stems from)
 
> I couldn't find definitions for directions such as left and right, up and
> down, forward and backward either, or colors, either.  Just as expected.

it's part of the language, expected to be implicitly understood.

i did a quick search "VHDL endianness" and came up with this:

https://insights.sigasi.com/tech/to-downto-ranges-vhdl/

that explains clearly what "downto" means. took me a while to absorb.

it is unfortunately the precise inverted opposite of how nmigen Cat() works. sigh.  but Cat() follows python argument ordering (*args), so hey what can you do.

fortunately the microwatt developers chose not to include "upto" which means we don't need to think of the underlying arithmetic as being BE or MSB0, only LE and LSB0.

which is why i picked it for an illustration.

so.

does that help, with those definitions at the VHDL language level, to understand the 4 combinations?  those are at comment #46 and comment #49.

i'll paste the relevant sections here:

  lv.byte_reverse := e_in.byte_reverse xnor ctrl.msr(MSR_LE);

this is the setup of the LDST module.  those are single bits so no discussion of ordering needed.

e_in.byte_reverse came from the Power Decoder, i'll paraphrase it from decode1.vhdl:
 
    Byte Reverse     Opcode
    0                ld
    1                ldbrx

you can check that in decode1.vhdl

this then is where the table comes from:

* ld LE=1 XNOR BR=0    lv.byte_reverse=0
* ld LE=0 XNOR BR=1    lv.byte_reverse=1
  (LE=0 means BE mode)

and for ldbrx:

* ldbrx LE=1 XNOR BR=0 lv.byte_reverse=1
* ldbrx LE=0 XNOR BR=1 lv.byte_reverse=0
  (LE=0 means BE mode)

argh hit send by mistake, have to edit from here.

so from there, let's move to the loadstore1.vhdl file.  and... urrr argh, line 342 they've merged the shift-and-byteswap into one function which i can't clearly interpret.

let's trust that it does what it says it does, and note simply that because everything is using "downto" this is a declaration that all arithmetic and all storage of data is performed in LSB0 order.

throughout the rest of the code we see no evidence of any other form of byteswapping *other* than here, in LDST.

there are a few other stages here, including checking the regfile layout (i am familiar with the microwatt code) so i know that it does not contain VHDL-language-implicit byteswapping (BE vs LE) or bitswapping (MSB0 rather than LSB0)

the *only* reordering for incoming memory is that XNOR line which performs explicit transformation

and that means - sorry to have to say this so plainly - that you are wrong about ld and ldbrx.

BE ld *does* perform a byteswap, because the XNOR combines in a way that merges the twin roles of LE/BE endianness "fixing" with the v3.0B spec meaning of "ldbrx" and "ld".

this is to get the data out of memory order into internal order and meaning within the VHDL source code of microwatt.

in other words the choice of LE and LSB0 in the microwatt regfile and ALUs is implicitly understood as part of the use of "downto" and part of the definition of arithmetic in the VHDL language.

with that decision having been made they then combined the dual roles to get an XNOR operator.

the reason i am asking you to walk through this is because i can see evidence that you are rejecting the above by stating "MSByte going into LSByte is nonsense, LD BE *must* go straight MSByte-to-MSByte", which unfortunately contradicts the reality of how microwatt (and LibreSOC) work for the 4 combinations ld/ldbrx LE/BE.

until i have confirmation that you understand and accept this, all further discussion is, sadly, invalid.

the same thing occurred with the 16bit Compressed discussion: ideas were presented before common ground was established, and flaws in understanding could only be worked out by stopping the discussion of ideas and focussing on understanding of the basics, first.

the exact same thing is happening here.

ideas are being put forward involving vectors and bitpacking before even the fundamentals of OpenPOWER v3.0B ld/ldbrx LE/BE interaction have been confirmed as common ground.

does that make sense? is it reasonable to establish common ground of the basics before moving on to ideas?

Comment 84 Luke Kenneth Casson Leighton 2021-01-05 16:27:12 GMT

(In reply to Jacob Lifshay from comment #82)

> still suffice since each individual byte can only be unswapped, 16-bit
> swapped, 32-bit swapped, or 64-bit swapped. no other combinations are
> supported/needed since they are ruled out by the natural alignment
> requirements.

this really needs diagrams to go over it.  i did a video, it is still uploading (horribly slowly)

i think what you are suggesting is one of the following:

* to impose a restriction on the permutations permitted by the Dynamic Partitioned SIMD ALUs (to 888888 16161616 3232 and 64) 

   this would, effectively at a specification level (conflating it with a limitation at the hardware level) eliminate the possibility for Dynamic SIMD ALUs to do 8888 in one half and use the other half for 32 bit, for example.  or, 8-8-32-8-8

* assumption that the elements will start at an aligned boundary (every 64 bits) and therefore that the permutations need only cover 88888888 16161616 3232 and 64.

  which sounds reasonable until the merging of different operations occurs, making use of empty lanes within the SIMD ALUs

now i have written them out those are the same thing.  i thought that one of them might involve moing the dynamic byteswapping to be part of MultiCompUnit, where the quantity of gates gets multiplied even more than it already is.

so, sorry, answer is no.  every way you look at it this is a total nightmare of insanity.  it simply can't go in.

Comment 85 Luke Kenneth Casson Leighton 2021-01-05 16:57:26 GMT

(In reply to Alexandre Oliva from comment #79)
> you may *think* the iteration order is already decided, but it really isn't.

we need to establish precisely where that lack of clarity exists, and word it accordingly and provide diagrams.

> the specification is incomplete.  it's ambiguous.  it works both ways.  but
> one has to be chosen. 

that choice is - was - LE.  it was implicitly made as a choice because RISC-V, which was when this was all originally designed, removed BE entirely from early drafts of the spec, well before 2017 (RISC-II or RISC-III, i don't know which)

consequently it never came up.

> and if you say this was chosen a year ago, 

2 years ago.  2018?

> that's
> just not true, because it's evident that the issue wasn't even considered,
> let alone decided on. 

it was... but unfortunately it's become apparent that the typedef union, even when it's mentioned that the underlying bytes are LE ordered, is insufficient to be able to help people infer that.

what hasn't happened before is: because this is now OpenPOWER the idea to allow element-level bytereversal has never been proposed before, so the lack of clarity never came up. 


> now, one of them will make for reasonable compiler implementation.

the one that has been picked is identical to the one that was described by the LLVM NEON article that Jacob found.  thus the expectation is that there will be no complications.  it might not be nice (the article explains that there are no nice choices), they had to pick the least-worst one.

the stackoverflow link i found shows that both NEON internal element order and overall register byte numbering order are both fixed and hardcoded to LE.

now, it may not be clear (or accepted) that this is the case: it is however the case.

this is not helped by the fact that it's obviously not clearly stated by the ARM spec on NEON: outsiders from stackexchange had to *infer* this LE ordering of both the registers and the elements by analysing the instructions and their behaviour!

which is nuts, but it is what it is.

useful discussions from this point onwards involve clarifying exactly and precisely any ambiguities in the spec that prevent people from understanding, clearly and unequivocably, that we are doing precisely and exactly to the letter what NEON does as "meaning" is concerned.

as best i can tell, a way to state that clearly and unambiguously is:

* the element ordering of NEON registers is linear and meaning is LE
* the ordering *in* elements within NEON registers has an LE meaning.
* the numbering of bytes comprising NEON registers *is* 0 refers to LSByte
* the numbering of bytes *in* elements is 0 refers to LSByte.

this is how it is in NEON. this is how it is in SV (with the overlapping).

is this clearer?

Comment 86 Jacob Lifshay 2021-01-05 17:02:08 GMT

(In reply to Luke Kenneth Casson Leighton from comment #84)
> (In reply to Jacob Lifshay from comment #82)
> 
> > still suffice since each individual byte can only be unswapped, 16-bit
> > swapped, 32-bit swapped, or 64-bit swapped. no other combinations are
> > supported/needed since they are ruled out by the natural alignment
> > requirements.
> 
> this really needs diagrams to go over it.  i did a video, it is still
> uploading (horribly slowly)
> 
> i think what you are suggesting is one of the following:
> 
> * to impose a restriction on the permutations permitted by the Dynamic
> Partitioned SIMD ALUs (to 888888 16161616 3232 and 64) 

That's what I proposed in comment #81, however in comment #82 I realized that we don't need to restrict it that much, we only need to restrict it to have elements be naturally aligned (not what you appear to think it is) which we need anyway since the SIMD mul unit already expects that.

I'll create some illustrations today.

> now i have written them out those are the same thing.  i thought that one of
> them might involve moing the dynamic byteswapping to be part of
> MultiCompUnit, where the quantity of gates gets multiplied even more than it
> already is.

The byte-swapping would be a pipeline stage right after the mux for selecting which FUs to execute, as well as another stage right before the results are sent back to those FUs. It most likely won't need to make any pipelines take any more clock cycles.

If we make changing the SPR bit for the CPUs data endian mode trap so we can run 64-bit byteswap instructions on all registers to keep the OpenPower expected values, then we don't need the byteswap HW at the register file, it's only needed in the ALU/ld-st/etc. pipelines.

Comment 87 Jacob Lifshay 2021-01-05 18:30:56 GMT

Reopening since this needs more evaluation:

If we copy what VSX does, then we'll have to implement the byteswapping in the ALUs, since that's what VSX does:

Notice how the big-endian and little-endian Power code is identical -- implying that the registers switch between big-endian and little-endian, however it changes on AArch64 (64-bit Arm) since they define the registers to be little endian only and need to insert explicit byteswapping instructions.

https://godbolt.org/z/nbM5qq

C Source:
#include <stdint.h>

typedef uint8_t u8x16 __attribute__((vector_size(16)));
typedef uint16_t u16x8 __attribute__((vector_size(16)));
typedef uint32_t u32x4 __attribute__((vector_size(16)));
typedef uint64_t u64x2 __attribute__((vector_size(16)));

void ld_st(u8x16 *a, u8x16 *b, u16x8 *c, u16x8 *r) {
    u16x8 temp = (u16x8)(*a + *b);
    *r = temp + *c;
}

u16x8 by_value(u8x16 a, u8x16 b, u16x8 c) {
    u16x8 temp = (u16x8)(a + b);
    return temp + c;
}

u16x8 load_array(uint16_t *a) {
    a = (uint16_t *)__builtin_assume_aligned(a, 16);
    u16x8 retval;
    for(int i = 0; i < 8; i++)
        retval[i] = a[i];
    return retval;
}

Generated big-endian powerpc64:
ld_st:                                  # @ld_st
        .quad   .Lfunc_begin0
        .quad   .TOC.@tocbase
        .quad   0
.Lfunc_begin0:
        lxv 34, 0(3)
        lxv 35, 0(4)
        vaddubm 2, 3, 2
        lxv 35, 0(5)
        vadduhm 2, 3, 2
        stxv 34, 0(6)
        blr
        .long   0
        .quad   0
by_value:                               # @by_value
        .quad   .Lfunc_begin1
        .quad   .TOC.@tocbase
        .quad   0
.Lfunc_begin1:
        vaddubm 2, 3, 2
        vadduhm 2, 2, 4
        blr
        .long   0
        .quad   0
load_array:                             # @load_array
        .quad   .Lfunc_begin2
        .quad   .TOC.@tocbase
        .quad   0
.Lfunc_begin2:
        lxv 34, 0(3)
        blr
        .long   0
        .quad   0

Generated little-endian powerpc64le:
ld_st:                                  # @ld_st
        lxv 34, 0(3)
        lxv 35, 0(4)
        vaddubm 2, 3, 2
        lxv 35, 0(5)
        vadduhm 2, 3, 2
        stxv 34, 0(6)
        blr
        .long   0
        .quad   0
by_value:                               # @by_value
        vaddubm 2, 3, 2
        vadduhm 2, 2, 4
        blr
        .long   0
        .quad   0
load_array:                             # @load_array
        lxv 34, 0(3)
        blr
        .long   0
        .quad   0

Generated big-endian AArch64:
ld_st:                                  // @ld_st
        ld1     { v0.16b }, [x0]
        ld1     { v1.16b }, [x1]
        ld1     { v2.8h }, [x2]
        add     v0.16b, v1.16b, v0.16b
        rev16   v0.16b, v0.16b
        add     v0.8h, v2.8h, v0.8h
        st1     { v0.8h }, [x3]
        ret
by_value:                               // @by_value
        rev64   v0.16b, v0.16b
        rev64   v1.16b, v1.16b
        ext     v0.16b, v0.16b, v0.16b, #8
        ext     v1.16b, v1.16b, v1.16b, #8
        rev64   v2.8h, v2.8h
        add     v0.16b, v1.16b, v0.16b
        ext     v2.16b, v2.16b, v2.16b, #8
        rev16   v0.16b, v0.16b
        add     v0.8h, v0.8h, v2.8h
        rev64   v0.8h, v0.8h
        ext     v0.16b, v0.16b, v0.16b, #8
        ret
load_array:                             // @load_array
        ldr     q0, [x0]
        ret

Generated little-endian AArch64:
ld_st:                                  // @ld_st
        ldr     q0, [x0]
        ldr     q1, [x1]
        ldr     q2, [x2]
        add     v0.16b, v1.16b, v0.16b
        add     v0.8h, v2.8h, v0.8h
        str     q0, [x3]
        ret
by_value:                               // @by_value
        add     v0.16b, v1.16b, v0.16b
        add     v0.8h, v0.8h, v2.8h
        ret
load_array:                             // @load_array
        ldr     q0, [x0]
        ret

Generated x86_64:
ld_st:
        movdqa  xmm0, XMMWORD PTR [rdi]
        paddb   xmm0, XMMWORD PTR [rsi]
        paddw   xmm0, XMMWORD PTR [rdx]
        movaps  XMMWORD PTR [rcx], xmm0
        ret
by_value:
        paddb   xmm0, xmm1
        paddw   xmm0, xmm2
        ret
load_array:
        movdqa  xmm0, XMMWORD PTR [rdi]
        ret

Comment 88 Luke Kenneth Casson Leighton 2021-01-05 19:11:25 GMT

(In reply to Jacob Lifshay from comment #86)

> I'll create some illustrations today.

this will help.
 
> > now i have written them out those are the same thing.  i thought that one of
> > them might involve moing the dynamic byteswapping to be part of
> > MultiCompUnit, where the quantity of gates gets multiplied even more than it
> > already is.
> 
> The byte-swapping would be a pipeline stage right after the mux for
> selecting which FUs to execute,

there are going to be at least QTY 50 (fifty) 64 bit regfile ports, each crossbar being around 2k gates that's 100,000 gates if placed at the regfile ports.

there are going to be around... 60 to 80 FUs (most of those "laned") which means around 4x60 64 bit src operands plus another 60 dest ports.  5x60 = 300 operand ports.

this cost in gates, being now not at every regfile port but at every operand port, is almost an order of magnitude larger.

300 x 2000 is... 600,000 gates in byte-reversal crossbars.

it's now of the order of 600,000 gates if placed at the source and dest operands in each pipeline.

we are *not* doing this.

Comment 89 Luke Kenneth Casson Leighton 2021-01-05 19:26:17 GMT

(In reply to Jacob Lifshay from comment #87)
> Reopening since this needs more evaluation:

mmm no, it doesn't.  the decision's been made.  what would help would be to provide additional clarification and illustration of why we are not going to be doing dynamic bytereversal, and to move as quickly as possible to clarifying the spec.
 
> If we copy what VSX does

which we're not.  some time maybe in 2+ years we might add a microcoding layer on top which provides bare minimum compliance with zero priority on performance, but that is far, far into the future.

> then we'll have to implement the byteswapping in
> the ALUs, since that's what VSX does:

given the cost and complexity, 18 billion transistors, this is the strongest possible case *not* to add dynamic bytelevel byteswapping in front of either the regfile or in the ALUs.

even doing it as microcoding would cause such a catastrophic bottleneck on the bytereversing that we would be in danger of penalising the entire BE mode.

 
> Notice how the big-endian and little-endian Power code is identical --
> implying that the registers switch between big-endian and little-endian,
> however it changes on AArch64 (64-bit Arm) since they define the registers
> to be little endian only and need to insert explicit byteswapping
> instructions.

great.  that's a solution in software.  one that doesn't add over half a million gates in byteswapping (or penalise BE mode due to bottlenecking through a reduced-bandwidth microcoded resource)

this is very useful to illustrate how VSX does things, and it also allows the opportunity to underscore how deeply expensive this proposal really is.

we are not following IBM's path here.

re-closing this as invalid.

can we move on to clarification of the spec, covering the issues and ambiguities that Alexandre has raised?

Comment 90 Jacob Lifshay 2021-01-05 20:38:20 GMT

(In reply to Luke Kenneth Casson Leighton from comment #88)
> (In reply to Jacob Lifshay from comment #86)
> 
> > I'll create some illustrations today.

Done, with lots of pretty colors!

https://libre-soc.org/openpower/sv/byteswap/

> this will help.
>  
> > > now i have written them out those are the same thing.  i thought that one of
> > > them might involve moing the dynamic byteswapping to be part of
> > > MultiCompUnit, where the quantity of gates gets multiplied even more than it
> > > already is.
> > 
> > The byte-swapping would be a pipeline stage right after the mux for
> > selecting which FUs to execute,
> 
> there are going to be at least QTY 50 (fifty) 64 bit regfile ports, each
> crossbar being around 2k gates that's 100,000 gates if placed at the regfile
> ports.

As shown in the illustration linked above, it works just fine with 5*64=320 gates per 64-bit byte-swapper, waay less than 2k gates.

The latest proposal doesn't have anything added to reg-file ports, we'll just trap and have SW handle 64-bit byte-swapping all the int/fp registers when changing the CPU between BE/LE modes.

Byte-swapping only occurs in ALUs/load/store -- anywhere where a instruction operates on int/fp registers. All other registers (CRs and SPRs mostly) don't need byte swapping since they are not byte-addressable and are just kept in LE mode permanently.

> there are going to be around... 60 to 80 FUs (most of those "laned") which
> means around 4x60 64 bit src operands plus another 60 dest ports.  5x60 =
> 300 operand ports.

The mux goes *not at the FUs* but at the ALU after the operands are muxed in from the different FUs. So, just counting the 128-bit SIMD mul-add ALU (since I'm not 100% sure about the full list of ALUs/load/store/etc. we will have), it will be 3 inputs 1 output at 128-bit width, so that's 4 io * 5 gates * 128 bits = 2560 more gates for the whole ALU. Waay less than you expected. I expect the number of additional gates required for the whole core to be on the order of 10-20k.

Comment 91 Luke Kenneth Casson Leighton 2021-01-05 23:50:30 GMT

rright.  ok.  i took a look at the byteswap diagram: the PartitionedMul not doing arbitrary permutations throws a royal spanner in the works, and we can't spend 2 months fixing that.  so we go with the flow.

with arbitrary permutations out, the "aligned" byteswapping is ok.  the gate count of 4-in MUXes looks to be around 8 (possibly less) which times 64 is around 512 gates.

multiplied up to 4 operands (src123 and dest) and 50 FUs this is 100,000 gates.

this is just about borderline tolerable.

i am therefore de-marking this as invalid and instead putting it as deferred.  if we have time (or if it proves to be so heavy a penalty elsewhere not to have it), it goes in.

Comment 92 Jacob Lifshay 2021-01-05 23:57:35 GMT

(In reply to Luke Kenneth Casson Leighton from comment #91)
> rright.  ok.  i took a look at the byteswap diagram: the PartitionedMul not
> doing arbitrary permutations throws a royal spanner in the works, and we
> can't spend 2 months fixing that.  so we go with the flow.
> 
> with arbitrary permutations out, the "aligned" byteswapping is ok.  the gate
> count of 4-in MUXes looks to be around 8 (possibly less) which times 64 is
> around 512 gates.
> 
> multiplied up to 4 operands (src123 and dest) and 50 FUs this is 100,000
> gates.

Are you calculating based on 4 muxes per FU or 4 muxes per pipeline? I've been saying we should have the muxes be per-pipeline rather than per-FU since that will reduce the gate count by a factor of 5 or so, getting the gate count to the 10-20k per-cpu mentioned earlier

I'll make another illustration.

> this is just about borderline tolerable.
> 
> i am therefore de-marking this as invalid and instead putting it as
> deferred.  if we have time (or if it proves to be so heavy a penalty
> elsewhere not to have it), it goes in.

Comment 93 Luke Kenneth Casson Leighton 2021-01-06 01:08:59 GMT

(In reply to Jacob Lifshay from comment #92)

> > multiplied up to 4 operands (src123 and dest) and 50 FUs this is 100,000
> > gates.
> 
> Are you calculating based on 4 muxes per FU or 4 muxes per pipeline?

tired.  supposed to be per pipeline, however the numbers on those when doing 4-lane vectors (regs banked modulo 4) for anything above regnum r31 it gets pretty mental.

basically 4 lanes plus scalar, all pipeline QTYies are multiplied by 5.  which means 5 lots of FMACs, probably 10 lots of Logical, 10 of ALU and so on.

nuts, eh? bit of overkill involved here given we were only aiming for 6 GFLOPs initially.  this will be more along the lines of... mmm.... 120.  1.5 ghz times 2 (FP32 SIMD) times 2 (MAC) times 5 (5 FPUs) times 4 (SMP cores).

whoops.  might run a bit hot, there

the other option: 12R8W regfiles.  we ain't doing that.

Comment 94 Luke Kenneth Casson Leighton 2021-01-06 12:58:18 GMT

thoughts on whether this is practical (at the ISA level)

* LD 32 bit (lw) brings in data, scalar
* scalar is supposed to be identical to vector of len VL=1
* LD behaviour CHANGES to "do not do byteswap, this is now ALU/regfiles job"
* unfortunately all v3.0B is 64 bit not 8, 16 or 32.

which means that to "work" there would need to be a LD width "tag" propagated to each regfile entry, just like in the Mill Architecture.

clearly that is unworkable (too radical a departure from OpenPOWER)

therefore the ALU/regfile bytereverse needs to be under EXPLICIT control of the ISA.

we are NOT chucking in 4 bits into svp64 to cover this (dest-rev, src1-rev, src2-rev)

however there is the Remap system and there is JUST about enough space there
https://libre-soc.org/openpower/sv/propagation/

the penalty (overhead) for setup of Remap however is a minimum of two 64-bit instructions. immediate thoughts on that: tough.

however the really nice thing about Remap is the individual control over which src and which dest registers get byte-reversed.

so the problem of LD/LDBRX bringing in data in one endian order then having to swap it back, all that goes away because Remap can say "ok the LD operation did the bytereverse on the src already, we just need to bytereverse the dest".

also Remap covers the order-inversion case of going from VL-1 down to 0... and *again*: that may be selectively applied to src and dest, arbitrarily.

the penalty for doing so is that it must go through the massive cyclic shifter, because the ordering does not match up with the backend ALU lanes. again: tough.

all doable, then, with caveats.

Comment 95 Jacob Lifshay 2021-01-06 17:13:42 GMT

(In reply to Luke Kenneth Casson Leighton from comment #94)
> thoughts on whether this is practical (at the ISA level)
> 
> * LD 32 bit (lw) brings in data, scalar
> * scalar is supposed to be identical to vector of len VL=1
> * LD behaviour CHANGES to "do not do byteswap, this is now ALU/regfiles job"
> * unfortunately all v3.0B is 64 bit not 8, 16 or 32.
> 
> which means that to "work" there would need to be a LD width "tag"
> propagated to each regfile entry, just like in the Mill Architecture.

a tag is not actually needed: lw with elwidth=default does normal load byteswapping and sign/zero extension to 64-bits then byteswaps the 64-bit result to the cpu's current endian, making the effective register value identical to the OpenPower expected value.

lw with elwidth=32 does normal load byteswapping, then byteswaps the 32-bit value to the cpu's current endian, writing the result to the first/current (depending on scalar/vector on dest reg) vector element. This all is equivalent to copying 32-bits to the correct vector element in the dest reg in the cpu's current endian.

I do not see how either of those instructions would require a register tag.

> clearly that is unworkable (too radical a departure from OpenPOWER)

I disagree, it matches how VSX works: where a scalar float load single will convert to 64-bits and store that in the register in the cpu's current endian. a vsx scalar float load 32-bit will load 32-bits and write to the first vector element in the cpu's current endian.
a SIMD vector float load 32-bit will load 4 32-bit values in successive memory locations, writing them to the first through 4th vector element in the cpu's current endian.

Comment 96 Luke Kenneth Casson Leighton 2021-01-06 17:52:48 GMT

(In reply to Jacob Lifshay from comment #95)
> (In reply to Luke Kenneth Casson Leighton from comment #94)
> > thoughts on whether this is practical (at the ISA level)
> > 
> > * LD 32 bit (lw) brings in data, scalar
> > * scalar is supposed to be identical to vector of len VL=1
> > * LD behaviour CHANGES to "do not do byteswap, this is now ALU/regfiles job"
> > * unfortunately all v3.0B is 64 bit not 8, 16 or 32.
> > 
> > which means that to "work" there would need to be a LD width "tag"
> > propagated to each regfile entry, just like in the Mill Architecture.
> 
> a tag is not actually needed: lw with elwidth=default does normal load
> byteswapping and sign/zero extension to 64-bits then byteswaps the 64-bit
> result to the cpu's current endian, making the effective register value
> identical to the OpenPower expected value.
> 
> lw with elwidth=32 does normal load byteswapping, then byteswaps the 32-bit
> value to the cpu's current endian, writing the result to the first/current
> (depending on scalar/vector on dest reg) vector element. This all is
> equivalent to copying 32-bits to the correct vector element in the dest reg
> in the cpu's current endian.
> 
> I do not see how either of those instructions would require a register tag.

the problems start when trying to interact between the two.  elwidth!=default has now become "special", meaning that elwidth=default is left in the wrong byte order when compared to elwidth!=default.

that forces a "mess" of opcodes to perform unnecessary copying of data from scalar regs to vector regs... which will still *fail* on the 64 bit case because elwidth=default VL=1 copying of scalar into vector has no "meaning" - no continuation - of the byteswapping.

i.e. to get the *64* bit scalars to be copied into vectors in the correct order is simply not possible and would require a "tag".

this is what i meant by that SV "identity behaviour" has been violated.

that being unacceptable, it will not work, and consequently requires explicit control over the bytereversing from the ISA, not as an implicit "special" violation.

Comment 97 Jacob Lifshay 2021-01-06 19:42:12 GMT

(In reply to Luke Kenneth Casson Leighton from comment #96)
> (In reply to Jacob Lifshay from comment #95)
> > (In reply to Luke Kenneth Casson Leighton from comment #94)
> > > thoughts on whether this is practical (at the ISA level)
> > > 
> > > * LD 32 bit (lw) brings in data, scalar
> > > * scalar is supposed to be identical to vector of len VL=1
> > > * LD behaviour CHANGES to "do not do byteswap, this is now ALU/regfiles job"
> > > * unfortunately all v3.0B is 64 bit not 8, 16 or 32.
> > > 
> > > which means that to "work" there would need to be a LD width "tag"
> > > propagated to each regfile entry, just like in the Mill Architecture.
> > 
> > a tag is not actually needed: lw with elwidth=default does normal load
> > byteswapping and sign/zero extension to 64-bits then byteswaps the 64-bit
> > result to the cpu's current endian, making the effective register value
> > identical to the OpenPower expected value.
> > 
> > lw with elwidth=32 does normal load byteswapping, then byteswaps the 32-bit
> > value to the cpu's current endian, writing the result to the first/current
> > (depending on scalar/vector on dest reg) vector element. This all is
> > equivalent to copying 32-bits to the correct vector element in the dest reg
> > in the cpu's current endian.
> > 
> > I do not see how either of those instructions would require a register tag.
> 
> the problems start when trying to interact between the two. 
> elwidth!=default has now become "special", meaning that elwidth=default is
> left in the wrong byte order when compared to elwidth!=default.
> 
> that forces a "mess" of opcodes to perform unnecessary copying of data from
> scalar regs to vector regs... which will still *fail* on the 64 bit case
> because elwidth=default VL=1 copying of scalar into vector has no "meaning"
> - no continuation - of the byteswapping.
> 
> i.e. to get the *64* bit scalars to be copied into vectors in the correct
> order is simply not possible and would require a "tag".
> 
> this is what i meant by that SV "identity behaviour" has been violated.
> 
> that being unacceptable, it will not work, and consequently requires
> explicit control over the bytereversing from the ISA, not as an implicit
> "special" violation.

Well, all the above would be resolved if we treated scalar arguments as referring to the whole register, not just one element in a register, as I proposed (and originally assumed was the case). Scalar here means register field marked scalar and subvl=1, it depends *only* on the bits in the svp64 prefix, not on VL or mask or anything else that's dynamically adjustable.

Comment 98 Jacob Lifshay 2021-01-06 19:45:01 GMT

(In reply to Jacob Lifshay from comment #97)
> Well, all the above would be resolved if we treated scalar arguments as
> referring to the whole register, not just one element in a register, as I
> proposed (and originally assumed was the case). Scalar here means register
> field marked scalar and subvl=1, it depends *only* on the bits in the svp64
> prefix, not on VL or mask or anything else that's dynamically adjustable.

It's the *combination* of scalar args meaning use the whole register *combined with* registers being kept in the cpu's current endian mode.

Comment 99 Jacob Lifshay 2021-01-06 19:48:24 GMT

(In reply to Luke Kenneth Casson Leighton from comment #96)
> i.e. to get the *64* bit scalars to be copied into vectors in the correct
> order is simply not possible and would require a "tag".

64-bit values in registers are byteswapped into the cpu's current endian *regardless* of if they are accessed as scalars or vector elements. byte-swapping is *not* limited to SV.

Comment 100 Luke Kenneth Casson Leighton 2021-01-06 20:12:13 GMT

(In reply to Jacob Lifshay from comment #97)

> Well, all the above would be resolved if we treated scalar arguments as
> referring to the whole register, not just one element in a register, as I
> proposed (and originally assumed was the case).

i explained why that would compromise SV.

the cases that need a full walkthrough to understand why explicit control is needed not implicit context-sensitive reversal are the interactions on:

* v3.0b ld
* v3.0b lw
* v3.0b st
* v3.0b stw
* v3.0b add (purely as an example)
* SV ld VL=1
* SV lw VL=1
* SV st VL=1
* SV stw VL=1
* SV add VL=1
* SV ld elwidth=32,destew=32 VL=1
* SV lw elwidth=32,destew=32 VL=1
* SV st elwidth=32,destew=32 VL=1
* SV stw elwidth=32,destew=32 VL=1
* SV add elwidth=32,destew=32 VL=1
* SV ld destew=32 VL=1
* SV lw destew=32 VL=1
* SV st destew=32 VL=1
* SV stw destew=32 VL=1
* SV add destew=32 VL=1
* SV ld elwidth=32 VL=1
* SV lw elwidth=32 VL=1
* SV st elwidth=32 VL=1
* SV stw elwidth=32 VL=1
* SV add elwidth=32 VL=1

and then VL=2 or 3 or 4.

the interaction on all of those - all permutations (all 55 of them) needs to be thought through, every single one.

the sheer scope of that should give you some idea as to why, when we are a year behind, i have been requesting again and again that we stop trying to get more of these deep-dive features into SV.

periphery ones that go in a tiny portion of the spec and use an existing bit of hardware (the 1<<r3 idea), no problem.

full-on fundamental redesigns of the regfile like this have to stop.

Comment 101 Jacob Lifshay 2021-08-16 17:55:43 BST

lkcl says he's fine with it as long as he doesn't have to do the work:
https://libre-soc.org/irclog/%23libre-soc.2021-08-16.log.html#t2021-08-16T17:45:38

> programmerjake:
> so, in other words, your fine with the idea if you don't have to implement it?
>
> lkcl:
> because i can't cope with it, mentally, given the logic-dyslexia and the
> complexity of everything else, basically, yes.
>
> programmerjake:
> sounds good to me! I'd guess lxo will agree too...
>
> lkcl:
> clarification: i'm not "totally fine with it" until it's actually *proven*
> and demonstrated not to be so massively intrusive and complex that it
> damages our chances of adoption and completion of the milestones set to date
>
> or results in unintended consequences well beyond "it's just a simple swap
> at the regfile"
>
> programmerjake:
> ok

Comment 102 Luke Kenneth Casson Leighton 2021-08-16 18:45:40 BST

(In reply to Jacob Lifshay from comment #101)
> lkcl says he's fine with it as long as he doesn't have to do the work:
> https://libre-soc.org/irclog/%23libre-soc.2021-08-16.log.html#t2021-08-16T17:
> 45:38

no, i did not explicitly say that.  i have a very good idea of exactly how much
work is involved and i am not happy about it. i still have to be the one that
assesses its impact, and makes sure that the person doing the work
actually does a full, thorough and complete job. 

that still leaves me with a burden of responsibility for something that
i know will take a hell of a lot of work, and risks damaging SVP64 by making
it virtually impossible to understand.

also i said that the task has to include a full and complete comparative
analysis against using the existing solution (LDST-bytereverse (ldbrx)
when Vectorised, plus the Bytereverse mode of REMAP).

in addition to that we cannot keep on adding features (especially high
impact fundamental low-level ones like this which take up huge amounts
of time even just to assess).

this one remains firmly *off* the table until we have time and resources to
properly assess it and ensure it does not completely destroy 3+ years of
work.

bottom line is i am not on the least bit happy eith this, i do not like
it, i do not like that i cannot understand its impact enough precisely
because it is so complex and low level.

thus it shall remain deferred.

Comment 103 Jacob Lifshay 2021-08-16 19:02:11 BST

(In reply to Luke Kenneth Casson Leighton from comment #102)
> (In reply to Jacob Lifshay from comment #101)
> > lkcl says he's fine with it as long as he doesn't have to do the work:
> > https://libre-soc.org/irclog/%23libre-soc.2021-08-16.log.html#t2021-08-16T17:
> > 45:38
> 
> no, i did not explicitly say that.  i have a very good idea of exactly how
> much
> work is involved and i am not happy about it. i still have to be the one that
> assesses its impact, and makes sure that the person doing the work
> actually does a full, thorough and complete job.

it's worth pointing out that you are not the only one on this project, others can/will assess impact too (whoever is implementing this -- probably me, as well as probably lxo, considering how much he was pushing for this)

> that still leaves me with a burden of responsibility for something that
> i know will take a hell of a lot of work, and risks damaging SVP64 by making
> it virtually impossible to understand.

it changes how sub-registers are addressed, everything else remains unchanged. Imho this is far from the most confusing part of SV (the most confusing part for me is probably vertical-first mode, or maybe the other FFT stuff).

> also i said that the task has to include a full and complete comparative
> analysis against using the existing solution (LDST-bytereverse (ldbrx)
> when Vectorised, plus the Bytereverse mode of REMAP).
> 
> in addition to that we cannot keep on adding features (especially high
> impact fundamental low-level ones like this which take up huge amounts
> of time even just to assess).

Well, I do think we will want this feature in the end (the ISA WG may also demand it for consistency with the rest of the ISA), and the longer we put this off, the harder it is to implement and the more work will be wasted trying to work around the lack of this feature.

Therefore, I think it's worth me spending a few weeks trying to implement in the simulator and testing it out, etc.

Comment 104 Luke Kenneth Casson Leighton 2021-08-16 19:08:26 BST

(In reply to Alexandre Oliva from comment #5)
>
> 
> GCC seems to regard vector types just like arrays, when it comes to memory
> layout, so indexing it operates like indexing arrays.  this does mean,
> however, that loading the vectors above from memory into a scalar 64-bit
> register will land element [0] at opposite ends depending on endianness. 

1) opposite end of what? the 64 bit register? (this does not happen)
or do you mean the entire array?
to consider the whole array to be endian ELEMENTS 7 6 5 4 3 2 1 0 rather
than 0 1 2 3 4 5 6 7 is a completely different matter from
the endianness of the *elements themselves*.

a "reverse gear" bit however has been added to SVP64 which
allows elements to be loaded in reverse order (VL-1 downto 0)

2) by defining the regfile as strictly as the typedef union says, when
considered as a LE system, the order is fixed and easy to understand.

when elwidth=default (64) each element does **NOT** end up in the "wrong"
order at all.  the use of the typedef union shows that the elements are loaded
64 bits, into 64 bit register boundaries.

when elwidth=32, Vectorised LD will load element 0 into the bottom
32 LSBs of the first 64 bit register, and element 1 into the top 32 MSBs of
the first 64 bit register.  if BE mode is enabled, each byte OF THE 32 BIT
QUANTITY will be reversed by the LD operation.

if that was not desired, just use ldbrx instead. that's what ldbrx is for.

following on from there and assuming that the subsequent instructions
are also elwidth=32 then the elements are correctly accessed in exactly
the same element order as when they were inserted by the Vector LD
operation.

if you didn't use Vectorised ldbrx to invert the data then that is your lookout.

if people want this feature added it has to be properly evaluated and explain
why use of ldbrx and SV REMAP bytereverse mode is not adequate.

however it will be given absolute lowest possible priority unless there is
a compelling reason demonstrated otherwise.

Comment 105 Luke Kenneth Casson Leighton 2021-08-16 19:31:45 BST

(In reply to Jacob Lifshay from comment #103)

> it's worth pointing out that you are not the only one on this project,

i am however the only one actually doing actual full time work.
if you had actually been helping out over the past 2 years as i hsve been
asking for 2 years that you actually do, i would be much more inclined
to listen because i would feel that i had time to do so.


> it changes how sub-registers are addressed, 

which is low level in the extreme and thus requires an extreme and thorough
assessment.

> everything else remains
> unchanged. Imho this is far from the most confusing part of SV (the most
> confusing part for me is probably vertical-first mode, or maybe the other
> FFT stuff).

it took me 2 years to realise what Mitch Alsup was talking about.
REMAP is a hardware function which remaps the element order *and*
can byte-reverse the access to element data from the regfile.

> 
> > also i said that the task has to include a full and complete comparative
> > analysis against using the existing solution (LDST-bytereverse (ldbrx)
> > when Vectorised, plus the Bytereverse mode of REMAP).
> > 
> > in addition to that we cannot keep on adding features (especially high
> > impact fundamental low-level ones like this which take up huge amounts
> > of time even just to assess).
> 
> Well, I do think we will want this feature in the end (the ISA WG may also
> demand it for consistency with the rest of the ISA), 

i am not going to let the VSX mindset poison SVP64.  just because IBM added
something in VSX does *not* automatically make it a good idea to add to SVP64.
far from it: i take it as a sign that it should NOT go into SVP64.

> and the longer we put
> this off, the harder it is to implement and the more work will be wasted
> trying to work around the lack of this feature.

ldbrx.

REMAP.

please take the time to properly assess those first.

aiuui REMAP actually already does exactly what you want: it performs
an EXPLICIT byte-reverse at the element level on register read and
register write.

aiuui what you are asking for is IMPLICIT bytereversal, which makes
me extremely nervous as it is a hell of a large impact.

> Therefore, I think it's worth me spending a few weeks trying to implement in
> the simulator and testing it out, etc.

i suggest that you first help by implementing the byte-reverse mode of REMAP.
rhis will also help you to understand REMAP itself.

then you will be in a strong position to understand it, and will not
waste time implementing something that turns out to be unnecessary
only after several weeks of work.

also once REMAP byte-reverse is implemented it will be *possible* to
hook in to test what you want to test.

also worth pointing out: elwidth overrides are needed first, long before
this work can be attempted.

without elwidth overrides there is the risk that unit tests would not
pick up errors where data went in the wrong byte.

given the frickin MSB0 ordering of Power ISA the probability of that
occurring is extremely high.

so.

task order is:

1) elwidth overrides (pseudocode, Simulator)

2) huge paranoid number of elwidth unit tests

3) bytereverse REMAP mode (on both read and write to regfile)

4) implicit bytereverse mode.