1183 – add /mrr mode (reverse mode) to Data-Dependent Fail-First CR_ops and "single looping" to DDFFirst in general

Bug 1183 - add /mrr mode (reverse mode) to Data-Dependent Fail-First CR_ops and "single looping" to DDFFirst in general

Summary: add /mrr mode (reverse mode) to Data-Dependent Fail-First CR_ops and "single ...

Status:	IN_PROGRESS

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Specification (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	shriya.sharma

URL:	https://libre-soc.org/openpower/sv/cr...

Depends on:	1185
Blocks:	1237 676 1044
	Show dependency tree / graph

Reported:	2023-10-11 15:14 BST by Luke Kenneth Casson Leighton
Modified:	2024-02-26 17:44 GMT (History)
CC List:	4 users (show)

See Also:	1044 1215
NLnet milestone:	NLnet.2022-08-107.ongoing
total budget (EUR) for completion of task and all subtasks:	6000
budget (EUR) for this task, excluding subtasks' budget:	6000
parent task for budget allocation:	1027
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:	lkcl=3500 jacob=500 ghostmansd=2000

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2023-10-11 15:14:17 BST

it was discovered when implementing pow(x,y,mod) in the cryptoprimitives
grant that "reverse data-dependent fail-first" is needed. this bugreport
needs to implement that, which means that SVP64 CR_ops needs a minor
redesign and associated cascade of underpinnings to both prove and
support the change (spec, insndb, ISACaller, binutils, and unit tests)

---

https://bugs.libre-soc.org/show_bug.cgi?id=1044#c56

* TODO powerdecoder (prefix mostly)
* TODO insndb
* TODO unit test sv.mcrf/mrr/sm=x/dm=y (see comment #2)
* DONE comment #19 unit test sv.cmpi/ff=lt/mr (see comment #4)
* TODO spec changes matching comment #19
* TODO spec changes matching comment #2
* TODO ISACaller unit tests
* TODO binutils
* TODO binutils unit tests


current spec:

|6 | 7 |19:20|21 | 22:23   |  description     |
|--|---|-----|---|---------|------------------|
|/ | / |0  0 |RG | dz  sz  | simple mode                      |
|/ | / |1  0 |RG | dz  sz  | scalar reduce mode (mapreduce) |
|zz|SNZ|VLI 1|inv|  CR-bit | Ffirst 3-bit mode      |
|/ |SNZ|VLI 1|inv|  dz sz  | Ffirst 5-bit mode (implies CR-bit from result) |

proposed spec draft:

|6 | 7 |19:20|21 | 22:23   |  description     |
|--|---|-----|---|---------|------------------|
|RG| / |0  0 |/  | dz  sz  | simple mode                      |
|RG| / |1  0 |/  | dz  sz  | scalar reduce mode (mapreduce) |
|RG|SNZ|VLI 1|inv|  CR-bit | Ffirst 3-bit mode (implies mapreduce, zz=1)|
|RG|SNZ|VLI 1|inv|  dz sz  | Ffirst 5-bit mode (implies mapreduce, CR-bit from result) |

notes:

* DDFFirst3 has implicit zeroing enabled on src and dest predication
* DDFFirst3 and 5 both imply mapreduce mode
  (continue loop even when result is scalar)
* DDFF enabled yes/no is just bit 20
* however DDFF enabled also enables mapreduce.
* therefore mapreduce is "19:20 is non-zero"
* bit 7 and 21 must be zero if bit 20 is zero (reserved, illegal)

Comment 1 Jacob Lifshay 2023-10-12 04:51:12 BST

(In reply to Luke Kenneth Casson Leighton from comment #0)
> it was discovered when implementing pow(x,y,mod) in the cryptoprimitives
> grant that "reverse data-dependent fail-first" is needed. this bugreport
> needs to implement that, but there is only room to do so for the 5-bit
> mode of CR_ops. fortunately this covers sv.cmpi which is a critically
> important use-case.

are you sure it covers sv.cmpi? cmpi has BF as the destination field which is 3 bits, not 5.

(In reply to Luke Kenneth Casson Leighton from bug 1044 comment #56)
> |6 | 7 |19:20|21 | 22:23   |  description     |
> |--|---|-----|---|---------|------------------|
> |/ | / |0  0 |RG | dz  sz  | simple mode                      |
> |/ | / |1  0 |RG | dz  sz  | scalar reduce mode (mapreduce) |
> |zz|SNZ|VLI 1|inv|  CR-bit | Ffirst 3-bit mode      |
> |/ |SNZ|VLI 1|inv|  dz sz  | Ffirst 5-bit mode (implies CR-bit from result) |

actually, we *can* still fit DD-FF 3-bit, all we need is to squeeze it into the unused space of scalar reduce mode, dropping some less-common flags (SNZ and zz):
> |6 | 7 |19:20|21 | 22:23   |  description     |
> |--|---|-----|---|---------|------------------|
> |0 | / |1  0 |RG | dz  sz  | scalar reduce mode (mapreduce) |
> |1 |VLI|1  0 |inv|  CR-bit | reversed Ffirst 3-bit mode      |
> |RG|SNZ|VLI 1|inv|  dz sz  | Ffirst 5-bit mode (implies CR-bit from result) |

though, now that I look at it more closely, do we need SNZ that much? if we don't, we can replace it with RG

Comment 2 Luke Kenneth Casson Leighton 2023-10-12 10:14:09 BST

(In reply to Jacob Lifshay from comment #1)

> are you sure it covers sv.cmpi? cmpi has BF as the destination field which
> is 3 bits, not 5.

arse. you're right.
 
> (In reply to Luke Kenneth Casson Leighton from bug 1044 comment #56)

> though, now that I look at it more closely, do we need SNZ that much?

yes. it provides the boolean-logic equivalent of AND OR NAND and NOR.
i learned this trick on thinking through the design of sv.bc. have
a look and you'll see why it's important.

damnit i hate doing redesigns of SV this late in the game.

 |6 | 7 |19:20|21 | 22:23   |  description     |
 |--|---|-----|---|---------|------------------|
 |RG|SNZ|0  0 |/  | dz  sz  | simple mode                      |
 |RG|SNZ|1  0 |/  | dz  sz  | scalar reduce mode (mapreduce) |
 |RG|SNZ|VLI 1|inv|  CR-bit | Ffirst 3-bit mode (implicit zz=1)     |
 |RG|SNZ|VLI 1|inv|  dz sz  | Ffirst 5-bit mode (implies CR-bit from result) |

that preserves the 

and just have to specify that zz=1 implicitly for 3-bit mode.
if people *really* want dz=sz=0 (twin-predication or single)
then they can take a copy of the relevant CR-bit-vector
using "sv.mcrf/sm=X/dm=y" to perform the compress/expand...
and *then* do sv.cmpi/mrr

ohhh... sigh, really must do a unit test sv.mcrf/mrr/sm=x/dm=y

Comment 3 Jacob Lifshay 2023-10-12 19:51:38 BST

(In reply to Luke Kenneth Casson Leighton from comment #2)
> (In reply to Jacob Lifshay from comment #1)
> > (In reply to Luke Kenneth Casson Leighton from bug 1044 comment #56)
> 
> > though, now that I look at it more closely, do we need SNZ that much?
> 
> yes. it provides the boolean-logic equivalent of AND OR NAND and NOR.
> i learned this trick on thinking through the design of sv.bc. have
> a look and you'll see why it's important.

ok, for fail first I can see that SNZ is intended to let you decide if the fail-first loop stops at the first predicated-off element or not. This needs to be clarified in the spec.
> 
> damnit i hate doing redesigns of SV this late in the game.
> 
>  |6 | 7 |19:20|21 | 22:23   |  description     |
>  |--|---|-----|---|---------|------------------|
>  |RG|SNZ|0  0 |/  | dz  sz  | simple mode                      |
>  |RG|SNZ|1  0 |/  | dz  sz  | scalar reduce mode (mapreduce) |
>  |RG|SNZ|VLI 1|inv|  CR-bit | Ffirst 3-bit mode (implicit zz=1)     |
>  |RG|SNZ|VLI 1|inv|  dz sz  | Ffirst 5-bit mode (implies CR-bit from result)
> |

imo SNZ is quite low priority for simple and mapreduce modes since all it can do is set the output to 1, it has no effect like in fail-first. I think higher priority is supporting zz=0 for fail-first 3-bit (where SNZ doesn't matter anyway because we're not zeroing):

|6 | 7 |19:20|21 | 22:23   |  description     |
|--|---|-----|---|---------|------------------|
|RG|0  |0  0 |/  | dz  sz  | simple mode                      |
|RG|0  |1  0 |/  | dz  sz  | scalar reduce mode (mapreduce) |
|RG|1  |VLI 0|inv|  CR-bit | Ffirst 3-bit mode (zz=0) |
|RG|SNZ|VLI 1|inv|  CR-bit | Ffirst 3-bit mode (zz=1) |
|RG|SNZ|VLI 1|inv|  dz sz  | Ffirst 5-bit mode (implies CR-bit from result) |

Another thing I realized while working on divmod, it would be really handy if stuff with a scalar destination would run through all elements instead of stopping at the first, I didn't check if we do that. This is useful for not overwriting all the CRs that you could be storing some other variable in.

e.g.:
sv.cmpi/ff=lt 0, 1, *10, 5
is:
i = 0
while i < VL:
    CR0 = cmpd(gpr[10 + i], 5)
    if CR0.lt:
        break
    i += 1
VL = i

note how only CR0 is ever written and yet the whole vector loop is run

Comment 4 Luke Kenneth Casson Leighton 2023-10-12 20:54:44 BST

(In reply to Jacob Lifshay from comment #3)

> imo SNZ is quite low priority for simple and mapreduce modes since all it
> can do is set the output to 1,

no: it inverts the *predicate* bit, otherwise set to zero,
so that it is TRUE rather than false.

again i reiterate: see sv.bc

> it has no effect like in fail-first. 

this is not true.

> I think higher priority is supporting zz=0 

your opinion here is invalid as you have misunderstood SNZ.

> Another thing I realized while working on divmod, it would be really handy
> if stuff with a scalar destination would run through all elements instead of
> stopping at the first, I didn't check if we do that.

that's exactly what /mr does. jacob this has been in the spec for
well over 18 months, possibly as long as 2 years. i am a little
alarmed that you are suggesting things that have been in the spec
that long.

although i will need to evaluate below to check...

>  This is useful for not
> overwriting all the CRs that you could be storing some other variable in.
> 
> e.g.:
> sv.cmpi/ff=lt 0, 1, *10, 5
> is:
> i = 0
> while i < VL:
>     CR0 = cmpd(gpr[10 + i], 5)
>     if CR0.lt:
>         break
>     i += 1
> VL = i

yep that's ddffirst with mapreduce.

> note how only CR0 is ever written and yet the whole vector loop is run

that will - should - be sv.cmpi/ff=lt/mr
(this is another good unit test to have, will
 add it to the list)

/mr simply switches off the normal "terminate at first scalar"
(which with no predication is the first element).

mr can be *used* for reduction because a scalar may be used as
src and dest and therefore as an accumulator. but it has many
more uses.

Comment 5 Jacob Lifshay 2023-10-12 21:15:10 BST

(In reply to Luke Kenneth Casson Leighton from comment #4)
> (In reply to Jacob Lifshay from comment #3)
> 
> > imo SNZ is quite low priority for simple and mapreduce modes since all it
> > can do is set the output to 1,
> 
> no: it inverts the *predicate* bit, otherwise set to zero,
> so that it is TRUE rather than false.
> 
> again i reiterate: see sv.bc

ok. same general effect in terms of cross-element decisions.
> 
> > it has no effect like in fail-first. 

SNZ still has no effect in simple and map-reduce modes, because SNZ is only relevant when there is a test that occurs across elements, rather than only within each element.

please write out a loop of what you think SNZ should do in simple mode. I expect that you will find it to be entirely redundant.

> > Another thing I realized while working on divmod, it would be really handy
> > if stuff with a scalar destination would run through all elements instead of
> > stopping at the first, I didn't check if we do that.
> 
> that's exactly what /mr does. 

I know that, the problem is /mr can not be set at the same time as fail-first because they are options in a 2-bit enum-like field (you can pick *only one* of simple, map-reduce, or fail-first). so either fail-first implies /mr behavior or not. I think it should imply /mr behavior.

Comment 6 Jacob Lifshay 2023-10-12 21:22:27 BST

(In reply to Jacob Lifshay from comment #5)
> please write out a loop of what you think SNZ should do in simple mode. I
> expect that you will find it to be entirely redundant.

I expect you'll end up with something like:

for i in range(VL):
    if CR[i].lt:
        RT[i] = RA[i] + RB[i]
        pred = nothing-really
    else:
        pred = SNZ
    # nothing uses pred!

Comment 7 Luke Kenneth Casson Leighton 2023-10-12 21:30:26 BST

(In reply to Jacob Lifshay from comment #5)

> SNZ still has no effect in simple and map-reduce modes, because SNZ is only
> relevant when there is a test that occurs across elements, rather than only
> within each element.
> 
> please write out a loop of what you think SNZ should do in simple mode. I
> expect that you will find it to be entirely redundant.

it's not. i leave you to write it out. remember i am at total burn-out.
confrontational approaches like this are just dangerous, now.
anyone asks me to do *anything* the automatic answer is "100 i will not".
however if *you* do it i will gladly walk through it and answer questions.

it also does not help that i am having difficulty recalling tests
and spec pages i wrote 6-18 months ago.

> > > Another thing I realized while working on divmod, it would be really handy
> > > if stuff with a scalar destination would run through all elements instead of
> > > stopping at the first, I didn't check if we do that.
> > 
> > that's exactly what /mr does. 
> 
> I know that, the problem is /mr can not be set at the same time as
> fail-first because they are options in a 2-bit enum-like field

arse, you're right. this stuff has dropped out of my electrical
memory.

 (you can pick
> *only one* of simple, map-reduce, or fail-first). so either fail-first
> implies /mr behavior or not. I think it should imply /mr behavior.

let me think about it. can you please also walk through the implications
for if anyone would *want* scalar destination and terminate at first?
both HF and VF.

Comment 8 Luke Kenneth Casson Leighton 2023-10-12 21:39:20 BST

(In reply to Jacob Lifshay from comment #6)

> I expect you'll end up with something like:
> 
> for i in range(VL):
>     if CR[i].lt:
>         RT[i] = RA[i] + RB[i]
>         pred = nothing-really
>     else:
>         pred = SNZ
>     # nothing uses pred!

this pseudocode is wrong, it is not how SNZ works with
DDFF. please read the spec pseudocode and if nothing
else ISACaller.

Comment 9 Jacob Lifshay 2023-10-12 21:41:45 BST

(In reply to Luke Kenneth Casson Leighton from comment #8)
> (In reply to Jacob Lifshay from comment #6)
> 
> > I expect you'll end up with something like:
> > 
> > for i in range(VL):
> >     if CR[i].lt:
> >         RT[i] = RA[i] + RB[i]
> >         pred = nothing-really
> >     else:
> >         pred = SNZ
> >     # nothing uses pred!
> 
> this pseudocode is wrong, it is not how SNZ works with
> DDFF. please read the spec pseudocode and if nothing
> else ISACaller.

this is for *simple mode*, which is (also with map-reduce mode) where I'm now objecting to introducing SNZ. I am no longer objecting to SNZ in dd-ff.

Comment 10 Jacob Lifshay 2023-10-12 21:42:44 BST

(In reply to Luke Kenneth Casson Leighton from comment #7)
> it's not. i leave you to write it out. remember i am at total burn-out.

i remembered only after clicking send. sorry.

Comment 11 Jacob Lifshay 2023-10-12 21:54:43 BST

(In reply to Luke Kenneth Casson Leighton from comment #8)
> it is not how SNZ works with
> DDFF. please read the spec pseudocode and if nothing
> else ISACaller.

I am unable to locate pseudo-code on the wiki for cr-based dd-ff, I tried searching for data-dependent and searching ls010 (iirc includes all relevant spec pages) for SNZ, the only pseudocode i saw was for branches. This needs to be corrected. I'll report a bug. I'll look at ISACaller next.

Comment 12 Luke Kenneth Casson Leighton 2023-10-13 04:39:42 BST

(In reply to Jacob Lifshay from comment #9)

> this is for *simple mode*, which is (also with map-reduce mode) where I'm
> now objecting to introducing SNZ. 

look again at the table. SNZ is not present. it depends critically
on there being some sort of test-and-stop (sv.bc has branch-fail/success,
DDFF has "end-loop")

there is no SNZ for simple mode.

this is how it works:

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/branches.mdwn;h=7aea7936fcff5fb7709bae9459255c593185af4d;hb=a3f5eea083533cb3828f0d2a4375c49e2a7b30f1#l598

Comment 13 Jacob Lifshay 2023-10-13 04:41:52 BST

(In reply to Luke Kenneth Casson Leighton from comment #12)
> look again at the table. SNZ is not present.

This table:
(In reply to Luke Kenneth Casson Leighton from comment #2)
>  |6 | 7 |19:20|21 | 22:23   |  description     |
>  |--|---|-----|---|---------|------------------|
>  |RG|SNZ|0  0 |/  | dz  sz  | simple mode                      |
>  |RG|SNZ|1  0 |/  | dz  sz  | scalar reduce mode (mapreduce) |
>  |RG|SNZ|VLI 1|inv|  CR-bit | Ffirst 3-bit mode (implicit zz=1)     |
>  |RG|SNZ|VLI 1|inv|  dz sz  | Ffirst 5-bit mode (implies CR-bit from result)|

Comment 14 Luke Kenneth Casson Leighton 2023-10-13 08:59:50 BST

(In reply to Jacob Lifshay from comment #13)
> (In reply to Luke Kenneth Casson Leighton from comment #12)
> > look again at the table. SNZ is not present.
> 
> This table:

current spec

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/cr_ops.mdwn;h=d4e81fc46e32cde93ec92c0d764ee16706b4fea9;hb=a3f5eea083533cb3828f0d2a4375c49e2a7b30f1#l79

SNZ cannot be added to simple or mapreduce as it is meaningless.

> (In reply to Luke Kenneth Casson Leighton from comment #2)
> >  |6 | 7 |19:20|21 | 22:23   |  description     |
> >  |--|---|-----|---|---------|------------------|
> >  |RG|SNZ|0  0 |/  | dz  sz  | simple mode                      |
> >  |RG|SNZ|1  0 |/  | dz  sz  | scalar reduce mode (mapreduce) |
> >  |RG|SNZ|VLI 1|inv|  CR-bit | Ffirst 3-bit mode (implicit zz=1)     |
> >  |RG|SNZ|VLI 1|inv|  dz sz  | Ffirst 5-bit mode (implies CR-bit from result)|

comment #2, that's me...

ah my mistake, i am having difficulty recalling the spec, it has
been too many months since i looked at this and my knowledge is
no longer in "electrical neuron memory", you'll need to be patient,
it will take a couple of days of thinking re-reading and conversation
for everything to come back.

Comment 15 Luke Kenneth Casson Leighton 2023-10-13 09:14:47 BST

ok i have edited comment #0 to put current and proposed spec change,
must go over implications for HF and VF. thought experiment necessary.
some simple examples. jacob can you illustrate what c code you'd
like to use sv.cmpi/ff/rg with? i can kinda grok it but for due diligence
it is good to spell out (and will make the basis of an illustrative
unit test)

Comment 16 Luke Kenneth Casson Leighton 2023-11-17 15:04:12 GMT

(In reply to Jacob Lifshay from comment #3)

> Another thing I realized while working on divmod, it would be really handy
> if stuff with a scalar destination would run through all elements instead of
> stopping at the first, I didn't check if we do that. This is useful for not
> overwriting all the CRs that you could be storing some other variable in.
> 
> e.g.:
> sv.cmpi/ff=lt 0, 1, *10, 5
> is:
> i = 0
> while i < VL:
>     CR0 = cmpd(gpr[10 + i], 5)
>     if CR0.lt:
>         break
>     i += 1
> VL = i
> 
> note how only CR0 is ever written and yet the whole vector loop is run

yes, i just discovered the exact same thing is needed for Fortran MAXLOC
(https://bugs.libre-soc.org/show_bug.cgi?id=676#c26)

there it is necessary to use RT=RA=scalar on sv.maxs./ff=lt to get it
to "track and return" the largest number *in a scalar* of the vector
of source numbers.

this implies *changing normal mode as well*

Comment 17 shriya.sharma 2023-12-07 17:17:30 GMT

Summary:
Data-dependent fail-on-first  - learnt the concept using the strncpy example.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_ldst.py

Gained understanding about control registers and compare function.
strncpy copies the contents of one string to the other. When you find there is a null character the remaining bits of the result are padded with further null characters.

In the example, the fail first instruction will truncate the VL at the first instance of zero byte. Then when storing VL bytes it will not overrun memory- update r12 address.
To append the remaining bytes with zeroes, we decrease CTR by vector length and stop when it is equal to 0.

Comment 18 shriya.sharma 2023-12-07 17:43:28 GMT

(In reply to Jacob Lifshay from comment #3)
> e.g.:
> sv.cmpi/ff=lt 0, 1, *10, 5
> is:
> i = 0
> while i < VL:
>     CR0 = cmpd(gpr[10 + i], 5)
>     if CR0.lt:
>         break
>     i += 1
> VL = i
> 
> note how only CR0 is ever written and yet the whole vector loop is run

The code added below is incontrast to the one above as it takes the 
vector value for register 0 in place of the scalar 0. 

sv.cmpi/ff=lt *0, 1, *10, 5
is:
i = 0
while i < VL:
    CR[i] = cmpd(gpr[10 + i], 5)
    if CR[i].lt:
        break
    i += 1
VL = i

Comment 19 Luke Kenneth Casson Leighton 2023-12-09 09:14:18 GMT

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=b077590

ok this unit test now works, that unblocks both maxloc bug #676

Comment 20 Luke Kenneth Casson Leighton 2023-12-11 02:44:53 GMT

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=ea249820c03

update spec "quirks" page to describe ddffirst has implicit mapreduce
mode

Comment 21 Jacob Lifshay 2023-12-12 02:46:12 GMT

The motivating example: https://bugs.libre-soc.org/show_bug.cgi?id=1044#c80