960 – Final Report on OPF ISA External RFC ls003 - maddedu, maddedus, divmod2du, dsld, dsrd

Bug 960 - Final Report on OPF ISA External RFC ls003 - maddedu, maddedus, divmod2du, dsld, dsrd

Summary: Final Report on OPF ISA External RFC ls003 - maddedu, maddedus, divmod2du, ds...

Status:	IN_PROGRESS

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Specification (show other bugs)
Version:	unspecified
Hardware:	Other Linux

Importance:	--- enhancement
Assignee:	Luke Kenneth Casson Leighton

URL:	https://libre-soc.org/openpower/sv/rf...

Depends on:	1010
Blocks:	952
	Show dependency tree / graph

Reported:	2022-10-20 22:32 BST by Luke Kenneth Casson Leighton
Modified:	2023-11-04 22:44 GMT (History)
CC List:	3 users (show)

See Also:	817 1013 952 1010 1029
NLnet milestone:	NLnet.2022-08-051.OPF
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:	1013
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2022-10-20 22:32:52 BST

external RFC for bigint instructions. report needed on completion.

WARNING, SOME OF THIS WORK PRE-DATES 25 OCT 2022 AND CANNOT RECEIVE
FUNDING UNDER 2022-08-051. unfortunately.  work dated after 25oct2022
however can.

Comment 1 Luke Kenneth Casson Leighton 2022-10-20 22:36:03 BST

https://libre-soc.org/irclog/%23libre-soc.2022-10-20.log.html#t2022-10-20T20:12:02

lkcl	octavius, the language of the RFC needs to be more "x is added to y and placed into z"	20:12
lkcl	the wording of maddld (etc) can be used as a template	20:12
lkcl	btw we need a bugreport for it	20:13
lkcl	https://bugs.libre-soc.org/show_bug.cgi?id=944 is for ls002	20:13
lkcl	so, maddhd says:	20:14
lkcl	"The 64-bit operands are (RA), (RB), and (RC). The	20:14
lkcl	128-bit product of the operands (RA) and (RB) is	20:14
lkcl	added to (RC). The high-order 64 bits of the 128-bit	20:14
lkcl	sum are placed into register RT.	20:14
lkcl	"	20:14
lkcl	therefore, we literally cut-and-paste that text	20:14
lkcl	and	20:14
lkcl	add	20:14
lkcl	"The low-order 64-bits of the 128-bit sum are placed into register RC"	20:15
lkcl	also, you removed the tag	20:17
lkcl	which ensures that ls003 is missing from this auto-generated page: https://libre-soc.org/openpower/sv/rfc/	20:17

Comment 2 Jacob Lifshay 2022-10-21 07:53:16 BST

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/rfc/ls003.mdwn;h=1f49463ff167180354770ceb5ddc100de7e41edd;hb=6e6902d0bfa700ccab5d640cb55d6867eff52547#l129

 129 RS is implictly defined as the same register as RC.
 130 
 131 *Programmer's Note:
 132 As a Scalar Power ISA operation, like `lq` and `stq`, RS=RT+1.

RS being defined as the same register as RC when non-svp64-prefixed is *not* how RS is defined currently, currently RS is defined to be RT+1.

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger.mdwn;h=ca6065eb035f370b0d45c1e8e62bf783eb33023d;hb=6e6902d0bfa700ccab5d640cb55d6867eff52547#l163

If we want to change the definition where RS=RC even without svp64, that's fine with me, but that should be done everywhere, rather than just in the rfc. Also, that should have been discussed on the mailing list or bug-tracker (if it was, sorry i forgot).

If we don't want to change RS=RT+1, then all mentions of RS=RC need to be deleted from the rfc, except when describing svp64's behavior or if describing an alternative to what we currently have.

Comment 3 Andrey Miroshnikov 2022-10-21 13:19:56 BST

(In reply to Luke Kenneth Casson Leighton from comment #1)
> lkcl	octavius, the language of the RFC needs to be more "x is added to y and
> placed into z"	20:12

Thanks for the comment, added the descriptions based on the PowerISA spec.
I left the description of the 'divmod2du' as well, as it has additional explanation (somewhat redundant perhaps, but has the additional background).
https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=d27c0ce4fb5360c6893ae982ca84e706c63ec3bf
https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=5a3a995ab07354c6524ed6a880c97e5205e0046f

> lkcl	also, you removed the tag	20:17
> lkcl	which ensures that ls003 is missing from this auto-generated page:
> https://libre-soc.org/openpower/sv/rfc/	20:17

Yeah, added that back in, didn't know what it was for before.
Do I need to run the Makefile in sv/rfc?
My pandoc (version 2.2.1) threw an error when trying to generate:
--normalize has been removed.  Normalization is now automatic.
--smart/-S has been removed.  Use +smart or -smart extension instead.
For example: pandoc -f markdown+smart -t markdown-smart.
Try pandoc --help for more information.
make: *** [Makefile:4: ls001.pdf] Error 2

(In reply to Jacob Lifshay from comment #2)
> If we don't want to change RS=RT+1, then all mentions of RS=RC need to be
> deleted from the rfc, except when describing svp64's behavior or if
> describing an alternative to what we currently have.

This change also threw me off. Didn't see such a definition before.

Comment 4 Andrey Miroshnikov 2022-10-21 13:24:50 BST

(In reply to Andrey Miroshnikov from comment #3)
> For example: pandoc -f markdown+smart -t markdown-smart.
> Try pandoc --help for more information.
> make: *** [Makefile:4: ls001.pdf] Error 2

Just figured it out, I was trying to use "-smart" as a separate option.
Using "-f markdown-smart" and removing --smart and --normalize (default in this pandoc version) fixed it for me.

Comment 5 Jacob Lifshay 2022-10-25 12:21:40 BST

oh, also, i just noticed: do we wwnt to have xlen in the rfc? if not, it needs to be replaced with 64. iirc we haven't proposed xlen yet, so imho we probably shouldn't start with this rfc.

Comment 6 Luke Kenneth Casson Leighton 2022-10-25 13:34:22 BST

(In reply to Jacob Lifshay from comment #5)
> oh, also, i just noticed: do we wwnt to have xlen in the rfc? if not, it
> needs to be replaced with 64. iirc we haven't proposed xlen yet, so imho we
> probably shouldn't start with this rfc.

yep good point, that needs to be a MASSIVE one-hit RFC as a patch
when the git repo is made public (eventually)

Comment 7 Jacob Lifshay 2022-11-13 05:10:36 GMT

if ls003 isn't submitted yet, we should add maddedus (unsigned bigint * signed word).
https://bugs.libre-soc.org/show_bug.cgi?id=817#c56
and also potentially both muledsbu (signed bigint * unsigned word) and muledsbs (signed bigint * signed word) if it can be made to work with signed bigints -- i have an idea for that...more about that below:

i was thinking about what I said in https://bugs.libre-soc.org/show_bug.cgi?id=817#c57 and think I have a solution -- basically corrections could be applied to RC to account for the previous multiplication being signed when it shouldn't -- not just adding RC, hence not naming madd*.
 if encoding space is an issue, imho Rc should be removed from dsld/dsrd as that is imho much less useful.

all 4 sign combinations of mule/madde are needed because only the msb word in a signed bigint is signed, lower words are always unsigned.

signed a * signed b:
                       a
  lsb        word 0    word 1    word 2   msb
  word 0     u*u       u*u       s*u          sv.muledsbu
b word 1     u*u       u*u       s*u          sv.muledsbu
  word 2     u*s       u*s       s*s          sv.muledsbs
  msb

unsigned a * signed b:
                       a
  lsb        word 0    word 1    word 2   msb
  word 0     u*u       u*u       u*u          sv.maddedu
b word 1     u*u       u*u       u*u          sv.maddedu
  word 2     u*s       u*s       u*s          sv.maddedus
  msb

unsigned a * unsigned b:
                       a
  lsb        word 0    word 1    word 2   msb
  word 0     u*u       u*u       u*u          sv.maddedu
b word 1     u*u       u*u       u*u          sv.maddedu
  word 2     u*u       u*u       u*u          sv.maddedu
  msb

Comment 8 Jacob Lifshay 2022-11-13 08:33:08 GMT

(In reply to Jacob Lifshay from comment #7)
> if ls003 isn't submitted yet, we should add maddedus (unsigned bigint *
> signed word).
> https://bugs.libre-soc.org/show_bug.cgi?id=817#c56
> and also potentially both muledsbu (signed bigint * unsigned word) and
> muledsbs (signed bigint * signed word) if it can be made to work with signed
> bigints -- i have an idea for that...more about that below:
> 
> i was thinking about what I said in
> https://bugs.libre-soc.org/show_bug.cgi?id=817#c57 and think I have a
> solution -- basically corrections could be applied to RC to account for the
> previous multiplication being signed when it shouldn't -- not just adding
> RC, hence not naming madd*.

i figured out the correct code for muledsbu:
muledsbu RT, RA, RB, RC
# correct for RC being high half of signed-unsigned mul
# when we need high half of unsigned mul
v[0:XLEN*2-1] <- [0] * XLEN || (RC)
if (RC)[0] != 0 then
    v[XLEN:XLEN*2-1] <- (RB) + (RC) # note only lower XLEN bits
<!-- no MULUS, so do it manually -->
prod[0:XLEN*2-1] <- [0] * (XLEN * 2)
if (RA)[0] != 0 then
    prod[0:XLEN*2-1] <- -(-(RA) * (RB))
else
    prod[0:XLEN*2-1] <- (RA) * (RB)
v <- v + prod
RT <- v[XLEN:XLEN*2-1]
RS <- v[0:XLEN-1]

i'll figure out muledsbs sunday or monday.

Comment 9 Luke Kenneth Casson Leighton 2022-11-13 10:56:31 GMT

table 13 p1353 1357 1358 Public v3.1, there are 12 spaces available.
using a large percentage of them would be unwise.

Comment 10 Jacob Lifshay 2022-11-13 20:41:05 GMT

(In reply to Luke Kenneth Casson Leighton from comment #9)
> table 13 p1353 1357 1358 Public v3.1, there are 12 spaces available.
> using a large percentage of them would be unwise.

yes, hence why I said:

(In reply to Jacob Lifshay from comment #7)
>  if encoding space is an issue, imho Rc should be removed from dsld/dsrd as
> that is imho much less useful.

one other option is using a spr (ctr? maybe a new 64-bit spr named carry?) instead of rs/rc for those operations that aren't very useful without carrying, such as the muled* and maddedu operations, since the existing maddld/maddhd[u] can just be used instead. maddedus (because it's signed/unsigned mul which doesn't currently exist in PowerISA), divmod2du, dsld/dsrd, and pcdec are quite useful for 128-bit arithmetic without VSX, prefix-code dec/enc, etc. and should still be 4-arg operations.

Comment 11 Luke Kenneth Casson Leighton 2022-11-13 20:54:11 GMT

(In reply to Jacob Lifshay from comment #10)

> one other option is using a spr

nice idea. greatly reduces opcode pressure but throws RS=R5+MAXVL
under a bus

> ctr?

no.

> maybe a new 64-bit spr named carry?)

it becomes part of mandatory context-switch state.
cannot have too many.  share with GF polynomial?

and i just realised, for parallel mode (RS=RT+MAXVL)
a whopping 127 SPRs are needed. drat.

Comment 12 Jacob Lifshay 2022-11-13 23:47:03 GMT

(In reply to Luke Kenneth Casson Leighton from comment #11)
> and i just realised, for parallel mode (RS=RT+MAXVL)
> a whopping 127 SPRs are needed. drat.

imho that isn't a problem...the instructions just won't support RS=RT+MAXVL mode as it isn't very useful for those specific instructions, hence why I limited it to just those instructions that aren't useful except for bigint stuff, where you always want RS=RC.

Comment 13 Luke Kenneth Casson Leighton 2022-11-14 00:38:11 GMT

(In reply to Jacob Lifshay from comment #12)
> (In reply to Luke Kenneth Casson Leighton from comment #11)
> > and i just realised, for parallel mode (RS=RT+MAXVL)
> > a whopping 127 SPRs are needed. drat.
> 
> the instructions just won't support RS=RT+MAXVL

that's not acceptable.

Comment 14 Jacob Lifshay 2022-11-14 00:42:59 GMT

(In reply to Luke Kenneth Casson Leighton from comment #13)
> (In reply to Jacob Lifshay from comment #12)
> > (In reply to Luke Kenneth Casson Leighton from comment #11)
> > > and i just realised, for parallel mode (RS=RT+MAXVL)
> > > a whopping 127 SPRs are needed. drat.
> > 
> > the instructions just won't support RS=RT+MAXVL
> 
> that's not acceptable.

why not? if they have a spr output instead, then they don't even have RS or RC at all. why force them to implement a mode that isn't very useful for software (it isn't useful because we specifically limited the spr-versions to those ops where users always want that would be the RS=RC mode) when the motivating reason for having that RS=RT+MAXVL mode ceases to be valid?

Comment 15 Luke Kenneth Casson Leighton 2022-11-14 02:29:29 GMT

(In reply to Jacob Lifshay from comment #14)

> why not? 

parallel processing of independent inputs and production of independent
parallel results.

one single SPR prohibits such programming constructs.

the loss of this option is not acceptable.

this is non-negotiable and this will be the first and only time that
i say that.  i cannot cope any longer with the repeated insistent and
persistent ignoring of decisions over the past three years, i have had
enough.

Comment 16 Jacob Lifshay 2022-11-14 10:40:15 GMT

(In reply to Luke Kenneth Casson Leighton from comment #15)
> (In reply to Jacob Lifshay from comment #14)
> 
> > why not? 
> 
> parallel processing of independent inputs and production of independent
> parallel results.
> 
> one single SPR prohibits such programming constructs.

my point was that, for that particular subset of instructions (mostly just signed bigint * [un]signed scalar), support for such programming constructs aren't needed, because software will always want to either use it in dependency-chain mode (where speedups come from wide mul e.g. 256x64->320-bit mul) or can easily use instructions not in that subset to achieve the desired result, hence the restriction isn't preventing any usecases.
> 
> the loss of this option is not acceptable.

ok. do note that adde/subfe has the exact same issue, and you're apparently fine with there not being a vector of CA flags...

we can just remove Rc from dsld/dsrd if needed to make space. imho signed bigint ops are waay more necessary than Rc on dsld/dsrd since they save a bunch of instructions.

for muledsbs (signed bigint * signed word), we may be stuck with the non-independently-parallelizability (sorta, OoO can run multiple vectors in parallel if the queues are big enough) because it's looking like 64 bits isn't enough carry bits, we may need it to also read/write CA:
working pseudocode so far:

def muledsbs(ra, rb, rc, ca):
    if not ca:
        v = rc
    else:
        v = zext(rb + rc, xlen)
    v = sext(ra, xlen) * sext(rb, xlen) + sext(v, xlen)
    rt = zext(v, xlen)
    rs = zext(v >> xlen, xlen)
    ca = sext(ra, xlen) < 0
    return rt, rs, ca

Comment 17 Jacob Lifshay 2022-11-14 11:05:35 GMT

(In reply to Jacob Lifshay from comment #16)
> for muledsbs (signed bigint * signed word), we may be stuck with the
> non-independently-parallelizability (sorta, OoO can run multiple vectors in
> parallel if the queues are big enough) because it's looking like 64 bits
> isn't enough carry bits, we may need it to also read/write CA:
> working pseudocode so far:
> 
> def muledsbs(ra, rb, rc, ca):

note that my current plan for how muledsbs is used is:
# mul little-endian signed bigint in r4-5 by signed word in r3, result in r4-6
setvl VL=2
addic 0, 0, 0 # clear ca
li 6, 0 # clear r6
sv.muledsbs *4, *4, 3, 6
# product now in r4-6

if you can come up with a way to implement muledsbs without needing ca and without instead needing to compute rc // rb, that'd be awesome!

Comment 18 Luke Kenneth Casson Leighton 2022-11-14 12:28:10 GMT

(In reply to Jacob Lifshay from comment #16)

> ok. do note that adde/subfe has the exact same issue,

i am planning a CR-field-base add for exactly that reason.

> for muledsbs (signed bigint * signed word), we may be stuck with the
> non-independently-parallelizability (sorta, OoO can run multiple vectors in
> parallel if the queues are big enough) because it's looking like 64 bits
> isn't enough carry bits, we may need it to also read/write CA:

absolutely not.  that is 4-in 3-out including XER.  please stop spending
time on that instruction, i will never put it forward to the OPF ISA WG.

for future reference, the hard limit is 3-in 2-out.  i will never put
forward to the ISA WG anything over that.  (the GF polynomial ops are a
special case as the entire ALU has a global polynomial which, if changed,
causes a multi-cycle stall)

Comment 19 Luke Kenneth Casson Leighton 2022-11-14 12:55:22 GMT

  75 # [DRAFT] Double-width Shift Left Doubleword
  76 
  77 VA2-Form
  78 
  80 * dsld.   RT,RA,RB,RC  (Rc=1)
  81 

damn damn damn these are also 4-in 3-out which will be such a high cost
in the Dependency Hazard Management that it will be impractical to
implement.  Rc=1 has to be eliminated.

this is what independent parallel (RS=RT+MAXVL) result production helps solve:
the result-pairs may be analysed (cmpi) and a third instruction (sv.or) used
to merge the partial results together.

a similar trick will have to be found for the other operations (using sv.subfe)

Comment 20 Luke Kenneth Casson Leighton 2022-11-14 12:57:12 GMT

(In reply to Luke Kenneth Casson Leighton from comment #19)

> a similar trick will have to be found for the other operations (using
> sv.subfe)

sigh hit send then realised that the [as-yet-still-being-thought-about]
add/sub-by-taking-carry-from-a-CR-field-and-storing-the-carry-out-in-a-CR-as-well
instruction could likely also help

Comment 21 Luke Kenneth Casson Leighton 2023-02-08 21:37:20 GMT

discussion suggests adding signed and unsigned variants

Comment 22 Jacob Lifshay 2023-02-09 01:26:24 GMT

(In reply to Luke Kenneth Casson Leighton from comment #21)
> discussion suggests adding signed and unsigned variants

as I understand it, the suggestion to add both signed and unsigned variants in the meeting today was for shadd, not for dsld/dsrd (though adding signed variants there too might be a good idea).

Comment 23 Luke Kenneth Casson Leighton 2023-03-03 11:22:34 GMT

under bug #1010 the first Draft of ls003 has been submitted.