553 – svp64 register mapping to accomidate AltiVec vectors expanding fp registers

Bug 553 - svp64 register mapping to accomidate AltiVec vectors expanding fp registers

Summary: svp64 register mapping to accomidate AltiVec vectors expanding fp registers

Status:	RESOLVED INVALID

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Specification (show other bugs)
Version:	unspecified
Hardware:	Other Linux

Importance:	--- enhancement
Assignee:	Jacob Lifshay

URL:

Depends on:
Blocks:

Reported:	2020-12-23 18:58 GMT by Jacob Lifshay
Modified:	2022-06-13 17:14 BST (History)
CC List:	2 users (show)

See Also:	213 558
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jacob Lifshay 2020-12-23 18:58:49 GMT

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/svp_rewrite/svp64.mdwn;h=18c885458f32f0c130007a106c5c4fe735a9344c;hb=c51a4914a4c5e39ec939df255917fb093d70ac61

Comment 1 Jacob Lifshay 2021-01-13 20:16:15 GMT

(In reply to Luke Kenneth Casson Leighton from bug #558 comment #57)
> (In reply to Jacob Lifshay from bug #558 comment #56)
> 
> > So, to be clear, you're advocating for not using the scheme I proposed just
> > now, or not using the scheme I proposed 18 months ago as part of the SVP for
> > RISC-V spec?
> 
> i'd really like to use both (dynamically), that was what the CR8x8 matrix
> concept was.  there is room to overload elwidth to do it... however the
> implications for the DMs are so complex that it would be foolish to try as a
> first iteration.
> 
> given that if we *don't* use vertical numbering on CRs we are forced instead
> to add a 1 year delay on the critical path it is clearly unacceptable to use
> the SVP scheme... for CRs

I would argue that we should use vertical numbering for all int/fp/cr register files since that makes for nice consistency as well as having benefits for register allocation.

you could think of it as extending OpenPower v3.1 scalar to have 4-reg vectors at every int/fp reg and 8-element vectors at every CR field.

This means we don't have to extend gcc's register allocator to handle ranges for the MVP, *saving several months time*. this will limit gcc for now to handling 8-element vectors or 256-bit vectors, whichever is smaller.

> given that it is clearly unacceptable to completely cut off entire swathes
> of the regfile from scalar operations

that's not a valid reason to prefer horizontal since both horizontal and vertical schemes cut off an equal number of registers.

> forcing the use of convoluted
> predicated mv operations if we *do* use vertical numbering on FP and Int
> operations it is clearly unacceptable to use the vertical numbering
> scheme... for FP and INT.

We don't realistically need that many scalar registers, 64 is (more than?) sufficient. the 128 are needed for vector purposes.

> conclusion: vertical numbering for CRs (reluctantly), horizontal numbering
> for INT and FP.

I disagree for the above reasons.

Comment 2 Luke Kenneth Casson Leighton 2021-01-14 13:38:39 GMT

(In reply to Jacob Lifshay from comment #1)

> I would argue that we should use vertical numbering for all int/fp/cr
> register files since that makes for nice consistency as well as having
> benefits for register allocation.

jacob: i already made it clear that the complexity in understanding the fractional numbering is too high.  it was almost two weeks before i understood it.

the only reasob for considering it for CRs is because we're forced to.  and CRs are Hell anyway, with the low 2 bits not being incremented through.


> you could think of it as extending OpenPower v3.1 scalar to have 4-reg
> vectors at every int/fp reg and 8-element vectors at every CR field.
> 
> This means we don't have to extend gcc's register allocator to handle ranges
> for the MVP, *saving several months time*.

i know.  i'm not happy about it.  but:
a) tough.  the hardware is too complex.  i have said it five or six times now: i am not redesigning the routing on the regfile.

b) there exists some explicit control over gcc fp/int regs where there is NONE on CRs.  as a concept they do not exist AT ALL in the frontend.  we are therefore FORCED to reluctantly use an alternative scheme.

> this will limit gcc for now to
> handling 8-element vectors or 256-bit vectors, whichever is smaller.

you are completely forgetting about the hardware design.  i do not want at this incredibly late stage to THINK about any kind of FP/INT redesign involving renumbering.

the only reason i'm considering it at all is because CRs are only 4 bits.  those can be batched up in groups of 8 which is only 32 wires.

the DMs are going to be shit, basically.  supporting mfcr is going to be a huge penalty on performance and require a rewrite of the whole CR pipeline.


 
> > given that it is clearly unacceptable to completely cut off entire swathes
> > of the regfile from scalar operations
> 
> that's not a valid reason to prefer horizontal since both horizontal and
> vertical schemes cut off an equal number of registers.

it's not about the quantity.

the way in which they are cut off forces the use of additional expensive instructions.


> > forcing the use of convoluted
> > predicated mv operations if we *do* use vertical numbering on FP and Int
> > operations it is clearly unacceptable to use the vertical numbering
> > scheme... for FP and INT.
> 
> We don't realistically need that many scalar registers, 64 is (more than?)
> sufficient. the 128 are needed for vector purposes.

having a system where interaction between the two PUNISHES developers for doing so is not going to fly.

i'm not running that by the OPF ISA WG.

i am really sorry, this one is also invalid.

we are under time pressure and we are wasting time discussing this.

we are not going to be adding 128 bit or VSX any time in the next 2+ years.

the INT/FP regfile design and routing is extremely complex, was done months ago, is *specifically* targetted at 64 bit and cannot change.

none of us can earn any donations from NLnet for continuing to discuss this.

can we please stop discussing this, i am getting very fed up of repeating myself, and getting very concerned that i am not earning any money for having to repeatedly go over something that is not going to happen.

can we PLEASE move on to implementation.

you keep putting me under huge pressure repeatedly by asking again and again for something that i have already said no on multiple times.  i appreciate that you want to do a full investigation but this is getting too much.  we HAVE to stop, i cannot cope.

Comment 3 Jacob Lifshay 2021-01-14 18:45:44 GMT

Ahh, so gcc already supporting contiguous register ranges in the register allocator combined with avoiding reworking existing instructions/ABIs in gcc that use register pairs finally sounds like a good enough reason to me to not implement #553. Lets just hope it can efficiently allocate large ranges without n^2 or n^3 runtime :)

Comment 4 Luke Kenneth Casson Leighton 2021-01-15 00:29:04 GMT

apologies, jacob, the information i am holding in my head (unimplemented) is becoming beyond my capacity to explain.  coupled with the difference between electrical and chemical neural recall (chemical is long-term and often difficult to access) i am basically getting symptoms best known by the phenomenon "writer's block".

in essence i "know" something is "not right" but am literally unable to say why because my memory recall is not responding immediately, and due to the length of time that has gone by on some of the details (2 years) it may actually be *several days* before details emerge sufficiently to be able to *begin* to describe them to you.

bottom line is that when you push and push and push basically demanding an exact and precise response *i cannot give you one* and this is terribly frustrating for me, not to be able to speak and give you the "exact" answer that you expect.

the only way that this is going to work is if implementation proceeds *right now*, without further delay, getting the core details out into code that can be reviewed, understood, and incrementally adjusted accordingly.

Comment 5 Jacob Lifshay 2021-01-15 00:47:24 GMT

(In reply to Luke Kenneth Casson Leighton from comment #4)
> the only way that this is going to work is if implementation proceeds *right
> now*, without further delay, getting the core details out into code that can
> be reviewed, understood, and incrementally adjusted accordingly.

Ok, then we should start implementing stuff! if you write it all out in code, it will likely become easier to think about! Writing it out can be one of the ways to think through the consequences.

The changes that implementing this bug report would require are pretty localized to the decoder anyway, so, if we decide that we need it (which we may want despite the extra work in gcc due to needing to efficiently support 128-bit data values for AES/SHA256/etc.), it should be pretty easy to add on afterwards.

One future option (that you don't need to think about now -- just start writing code) is to instead conceptually base the SV isa on 128-bit registers, where all integer and fp registers are rearranged into 64 128-bit registers (instead of 128 64-bit registers) and standard 64-bit scalar operations just operate on the lower 64-bits. Vector operations tack a sequence of 128-bit registers together to form the backing storage for SV vectors.

I'm marking this bug as deferred, since we *do* need to think about it later, just not right now, instead of saying we'll never do it.

We can defer this till after we have an initial working cpu design.

Comment 6 Luke Kenneth Casson Leighton 2022-06-12 16:15:28 BST

cloding as invalid. a future version of SVP64 on top of quad precision instructions (excluding all packed simd instructions) is more appropriate
and is nothing to do with this invalid mapping concept.

Comment 7 Jacob Lifshay 2022-06-12 19:15:18 BST

(In reply to Luke Kenneth Casson Leighton from comment #6)
> cloding as invalid. a future version of SVP64 on top of quad precision
> instructions (excluding all packed simd instructions) is more appropriate
> and is nothing to do with this invalid mapping concept.

it's still valid, because those quad-precision instructions are already defined to use vsx registers. we will still want some kind of mapping because we want the scalar 128-bit instructions to use the exact same registers as are used for 128-bit vector elements.

As currently defined, svp64 is critically incompatible due to register layout with svp64 on top of the existing 128-bit scalar instructions, assuming svp64 uses contiguous element packing and doesn't skip high-64-bit-halves of the 128-bit fp registers that you get from existing 128-bit openpower scalar instructions.

e.g.:
(f128 scalar add with svp64 vector prefix)
sv.xsaddqp vsr0.v, vsr32.v, vsr64.v with elwidth=128 uses:

+-----+-----+-----+-----+
|  f0 |  f1 |  f2 |  f3 | fp register number
+-----+-----+-----+-----+
| el0 | el1 | el2 | el3 | 64-bit high half of scalar
+-----+-----+-----+-----+
| el0 | el1 | el2 | el3 | 64-bit low half of scalar
+-----+-----+-----+-----+

(f64 add)
sv.fadd f0.v, f32.v, f64.v with elwidth=64 according to current svp64 uses (wasting the high-half of the 128-bit registers, unless remapped):

+-----+-----+-----+-----+
|  f0 |  f1 |  f2 |  f3 | fp register number
+-----+-----+-----+-----+
| --- | --- | --- | --- | 64-bit high half of scalar
+-----+-----+-----+-----+
| el0 | el1 | el2 | el3 | 64-bit low half of scalar
+-----+-----+-----+-----+

wasting the high-half isn't acceptable imho because we'd need 2x as big of a register file for the uncommon uses of f128 or 128-bit integers -- therefore the high half should instead be some other 64-bit registers so svp64 can use it for vectors of 8/16/32/64-bit elements -- aka. remapping.

I've proposed the high halves of f0-63 be f64-127.

Comment 8 Jacob Lifshay 2022-06-12 19:21:38 BST

(In reply to Jacob Lifshay from comment #7)
> it's still valid, because those quad-precision instructions are already
> defined to use vsx registers. we will still want some kind of mapping
> because we want the scalar 128-bit instructions to use the exact same
> registers as are used for 128-bit vector elements.

marking as valid.

Comment 9 Luke Kenneth Casson Leighton 2022-06-12 22:21:04 BST

(In reply to Jacob Lifshay from comment #7)

> I've proposed the high halves of f0-63 be f64-127.

i know.  it's not happening.  and it is completely unnecessary.
when someone puts SVP64 on top of Scalar VSX, it happens automatically
and implicitly because the VSX registers already have the mapping
that you describe.

please leave this bugreport as closed and invalid.

Comment 10 Jacob Lifshay 2022-06-13 05:48:10 BST

(In reply to Luke Kenneth Casson Leighton from comment #9)
> (In reply to Jacob Lifshay from comment #7)
> 
> > I've proposed the high halves of f0-63 be f64-127.
> 
> i know.  it's not happening.  and it is completely unnecessary.

I strongly disagree:

basically, there's two choices for what to do with the high 64-bits of the vsx registers:
1. 8/16/32/64-bit svp64 vectors use them -- this is incompatible to svp64 without vsx registers because which registers elements map to changes:
   f32x8 starting at f0:
   with vsx:
   el0-3 -- vsx reg 0 (f0)
   el4-7 -- vsx reg 1 (f1)
   without vsx:
   el0-1 -- f0
   el2-3 -- f1
   el4-5 -- f2
   el6-7 -- f3

   so if a program wants to use a scalar operation to read elements 4-5 of the vector, does it use f2 and break if the cpu has vsx or use f1 and break if the cpu doesn't have vsx?!

2. 8/16/32/64-bit svp64 vectors don't use the high halves of the vsx registers -- you waste half your register file because you need 128 registers and the registers have to be 128-bit but all but 128-bit operations can only pack their vector elements in the low halves of the registers!

> when someone puts SVP64 on top of Scalar VSX, it happens automatically
> and implicitly because the VSX registers already have the mapping
> that you describe.

That's news to me, iirc we never changed the spec to state that f64-f127 ever mapped to any existing registers.

if they don't, we still have exactly the problem i described above.

Comment 11 Luke Kenneth Casson Leighton 2022-06-13 10:31:30 BST

future problem, not our problem, VSX is completely out of scope.
we are not expanding SVP64 FP registers for the purposes of fitting
with VSX. any attempt to do so will poison SVP64.
SVP64 registers are expanded in a way that is easy to understand
and works with GPUs.

Intel and VSX retrofitted FP into a quadrant of Packed SIMD registers
in a way that has absolutely nothing to do with Vector ISAs, and is
based entirely on 32 FP / 64 VSX/AVX numbering.  that is THEIR PROBLEM
to deal with the fall-out of that decision.

plwase drop this discussion, we are NEVER going to poison or damage
SVP64 for the purposes of fitting with a legacy 20 year old broken
concept.

if people want to do harm by retrofitting with a broken Packed SIMD ISA
concept they are perfectly well at liberty to do so on their own time
and at their own expense.

we have TOO MUCH ELSE TO DO.

any further discussion is wasting our time, energy and resources.

please drop this matter and consider it permanently closed.

Comment 12 Jacob Lifshay 2022-06-13 17:12:33 BST

(In reply to Luke Kenneth Casson Leighton from comment #11)
> future problem, not our problem, VSX is completely out of scope.
> we are not expanding SVP64 FP registers for the purposes of fitting
> with VSX. any attempt to do so will poison SVP64.
> SVP64 registers are expanded in a way that is easy to understand
> and works with GPUs.

well, vsx registers are used for 128-bit scalar operations (f128, i128), so, unless you want to propose a completely separate set of 128-bit arithmetic operations (which imho will never fly with the openpower foundation because *they already have them*), by your deciding to not make any attempt to interoperate with vsx registers, we are effectively permanently locked out of 128-bit arithmetic instructions, which imho is foolish.

Comment 13 Jacob Lifshay 2022-06-13 17:14:06 BST

(In reply to Jacob Lifshay from comment #12)
> well, vsx registers are used for 128-bit scalar operations (f128, i128), so,
> unless you want to propose a completely separate set of 128-bit arithmetic
> operations (which imho will never fly with the openpower foundation because
> *they already have them*), by your deciding to not make any attempt to
> interoperate with vsx registers, we are effectively permanently locked out
> of 128-bit arithmetic instructions, which imho is foolish.

this means just vsx registers, not the whole of vsx/vmx.