it was discovered when implementing pow(x,y,mod) in the cryptoprimitives grant that "reverse data-dependent fail-first" is needed. this bugreport needs to implement that, which means that SVP64 CR_ops needs a minor redesign and associated cascade of underpinnings to both prove and support the change (spec, insndb, ISACaller, binutils, and unit tests) --- https://bugs.libre-soc.org/show_bug.cgi?id=1044#c56 * TODO powerdecoder (prefix mostly) * TODO insndb * TODO unit test sv.mcrf/mrr/sm=x/dm=y (see comment #2) * DONE comment #19 unit test sv.cmpi/ff=lt/mr (see comment #4) * TODO spec changes matching comment #19 * TODO spec changes matching comment #2 * TODO ISACaller unit tests * TODO binutils * TODO binutils unit tests current spec: |6 | 7 |19:20|21 | 22:23 | description | |--|---|-----|---|---------|------------------| |/ | / |0 0 |RG | dz sz | simple mode | |/ | / |1 0 |RG | dz sz | scalar reduce mode (mapreduce) | |zz|SNZ|VLI 1|inv| CR-bit | Ffirst 3-bit mode | |/ |SNZ|VLI 1|inv| dz sz | Ffirst 5-bit mode (implies CR-bit from result) | proposed spec draft: |6 | 7 |19:20|21 | 22:23 | description | |--|---|-----|---|---------|------------------| |RG| / |0 0 |/ | dz sz | simple mode | |RG| / |1 0 |/ | dz sz | scalar reduce mode (mapreduce) | |RG|SNZ|VLI 1|inv| CR-bit | Ffirst 3-bit mode (implies mapreduce, zz=1)| |RG|SNZ|VLI 1|inv| dz sz | Ffirst 5-bit mode (implies mapreduce, CR-bit from result) | notes: * DDFFirst3 has implicit zeroing enabled on src and dest predication * DDFFirst3 and 5 both imply mapreduce mode (continue loop even when result is scalar) * DDFF enabled yes/no is just bit 20 * however DDFF enabled also enables mapreduce. * therefore mapreduce is "19:20 is non-zero" * bit 7 and 21 must be zero if bit 20 is zero (reserved, illegal)
(In reply to Luke Kenneth Casson Leighton from comment #0) > it was discovered when implementing pow(x,y,mod) in the cryptoprimitives > grant that "reverse data-dependent fail-first" is needed. this bugreport > needs to implement that, but there is only room to do so for the 5-bit > mode of CR_ops. fortunately this covers sv.cmpi which is a critically > important use-case. are you sure it covers sv.cmpi? cmpi has BF as the destination field which is 3 bits, not 5. (In reply to Luke Kenneth Casson Leighton from bug 1044 comment #56) > |6 | 7 |19:20|21 | 22:23 | description | > |--|---|-----|---|---------|------------------| > |/ | / |0 0 |RG | dz sz | simple mode | > |/ | / |1 0 |RG | dz sz | scalar reduce mode (mapreduce) | > |zz|SNZ|VLI 1|inv| CR-bit | Ffirst 3-bit mode | > |/ |SNZ|VLI 1|inv| dz sz | Ffirst 5-bit mode (implies CR-bit from result) | actually, we *can* still fit DD-FF 3-bit, all we need is to squeeze it into the unused space of scalar reduce mode, dropping some less-common flags (SNZ and zz): > |6 | 7 |19:20|21 | 22:23 | description | > |--|---|-----|---|---------|------------------| > |0 | / |1 0 |RG | dz sz | scalar reduce mode (mapreduce) | > |1 |VLI|1 0 |inv| CR-bit | reversed Ffirst 3-bit mode | > |RG|SNZ|VLI 1|inv| dz sz | Ffirst 5-bit mode (implies CR-bit from result) | though, now that I look at it more closely, do we need SNZ that much? if we don't, we can replace it with RG
(In reply to Jacob Lifshay from comment #1) > are you sure it covers sv.cmpi? cmpi has BF as the destination field which > is 3 bits, not 5. arse. you're right. > (In reply to Luke Kenneth Casson Leighton from bug 1044 comment #56) > though, now that I look at it more closely, do we need SNZ that much? yes. it provides the boolean-logic equivalent of AND OR NAND and NOR. i learned this trick on thinking through the design of sv.bc. have a look and you'll see why it's important. damnit i hate doing redesigns of SV this late in the game. |6 | 7 |19:20|21 | 22:23 | description | |--|---|-----|---|---------|------------------| |RG|SNZ|0 0 |/ | dz sz | simple mode | |RG|SNZ|1 0 |/ | dz sz | scalar reduce mode (mapreduce) | |RG|SNZ|VLI 1|inv| CR-bit | Ffirst 3-bit mode (implicit zz=1) | |RG|SNZ|VLI 1|inv| dz sz | Ffirst 5-bit mode (implies CR-bit from result) | that preserves the and just have to specify that zz=1 implicitly for 3-bit mode. if people *really* want dz=sz=0 (twin-predication or single) then they can take a copy of the relevant CR-bit-vector using "sv.mcrf/sm=X/dm=y" to perform the compress/expand... and *then* do sv.cmpi/mrr ohhh... sigh, really must do a unit test sv.mcrf/mrr/sm=x/dm=y
(In reply to Luke Kenneth Casson Leighton from comment #2) > (In reply to Jacob Lifshay from comment #1) > > (In reply to Luke Kenneth Casson Leighton from bug 1044 comment #56) > > > though, now that I look at it more closely, do we need SNZ that much? > > yes. it provides the boolean-logic equivalent of AND OR NAND and NOR. > i learned this trick on thinking through the design of sv.bc. have > a look and you'll see why it's important. ok, for fail first I can see that SNZ is intended to let you decide if the fail-first loop stops at the first predicated-off element or not. This needs to be clarified in the spec. > > damnit i hate doing redesigns of SV this late in the game. > > |6 | 7 |19:20|21 | 22:23 | description | > |--|---|-----|---|---------|------------------| > |RG|SNZ|0 0 |/ | dz sz | simple mode | > |RG|SNZ|1 0 |/ | dz sz | scalar reduce mode (mapreduce) | > |RG|SNZ|VLI 1|inv| CR-bit | Ffirst 3-bit mode (implicit zz=1) | > |RG|SNZ|VLI 1|inv| dz sz | Ffirst 5-bit mode (implies CR-bit from result) > | imo SNZ is quite low priority for simple and mapreduce modes since all it can do is set the output to 1, it has no effect like in fail-first. I think higher priority is supporting zz=0 for fail-first 3-bit (where SNZ doesn't matter anyway because we're not zeroing): |6 | 7 |19:20|21 | 22:23 | description | |--|---|-----|---|---------|------------------| |RG|0 |0 0 |/ | dz sz | simple mode | |RG|0 |1 0 |/ | dz sz | scalar reduce mode (mapreduce) | |RG|1 |VLI 0|inv| CR-bit | Ffirst 3-bit mode (zz=0) | |RG|SNZ|VLI 1|inv| CR-bit | Ffirst 3-bit mode (zz=1) | |RG|SNZ|VLI 1|inv| dz sz | Ffirst 5-bit mode (implies CR-bit from result) | Another thing I realized while working on divmod, it would be really handy if stuff with a scalar destination would run through all elements instead of stopping at the first, I didn't check if we do that. This is useful for not overwriting all the CRs that you could be storing some other variable in. e.g.: sv.cmpi/ff=lt 0, 1, *10, 5 is: i = 0 while i < VL: CR0 = cmpd(gpr[10 + i], 5) if CR0.lt: break i += 1 VL = i note how only CR0 is ever written and yet the whole vector loop is run
(In reply to Jacob Lifshay from comment #3) > imo SNZ is quite low priority for simple and mapreduce modes since all it > can do is set the output to 1, no: it inverts the *predicate* bit, otherwise set to zero, so that it is TRUE rather than false. again i reiterate: see sv.bc > it has no effect like in fail-first. this is not true. > I think higher priority is supporting zz=0 your opinion here is invalid as you have misunderstood SNZ. > Another thing I realized while working on divmod, it would be really handy > if stuff with a scalar destination would run through all elements instead of > stopping at the first, I didn't check if we do that. that's exactly what /mr does. jacob this has been in the spec for well over 18 months, possibly as long as 2 years. i am a little alarmed that you are suggesting things that have been in the spec that long. although i will need to evaluate below to check... > This is useful for not > overwriting all the CRs that you could be storing some other variable in. > > e.g.: > sv.cmpi/ff=lt 0, 1, *10, 5 > is: > i = 0 > while i < VL: > CR0 = cmpd(gpr[10 + i], 5) > if CR0.lt: > break > i += 1 > VL = i yep that's ddffirst with mapreduce. > note how only CR0 is ever written and yet the whole vector loop is run that will - should - be sv.cmpi/ff=lt/mr (this is another good unit test to have, will add it to the list) /mr simply switches off the normal "terminate at first scalar" (which with no predication is the first element). mr can be *used* for reduction because a scalar may be used as src and dest and therefore as an accumulator. but it has many more uses.
(In reply to Luke Kenneth Casson Leighton from comment #4) > (In reply to Jacob Lifshay from comment #3) > > > imo SNZ is quite low priority for simple and mapreduce modes since all it > > can do is set the output to 1, > > no: it inverts the *predicate* bit, otherwise set to zero, > so that it is TRUE rather than false. > > again i reiterate: see sv.bc ok. same general effect in terms of cross-element decisions. > > > it has no effect like in fail-first. SNZ still has no effect in simple and map-reduce modes, because SNZ is only relevant when there is a test that occurs across elements, rather than only within each element. please write out a loop of what you think SNZ should do in simple mode. I expect that you will find it to be entirely redundant. > > Another thing I realized while working on divmod, it would be really handy > > if stuff with a scalar destination would run through all elements instead of > > stopping at the first, I didn't check if we do that. > > that's exactly what /mr does. I know that, the problem is /mr can not be set at the same time as fail-first because they are options in a 2-bit enum-like field (you can pick *only one* of simple, map-reduce, or fail-first). so either fail-first implies /mr behavior or not. I think it should imply /mr behavior.
(In reply to Jacob Lifshay from comment #5) > please write out a loop of what you think SNZ should do in simple mode. I > expect that you will find it to be entirely redundant. I expect you'll end up with something like: for i in range(VL): if CR[i].lt: RT[i] = RA[i] + RB[i] pred = nothing-really else: pred = SNZ # nothing uses pred!
(In reply to Jacob Lifshay from comment #5) > SNZ still has no effect in simple and map-reduce modes, because SNZ is only > relevant when there is a test that occurs across elements, rather than only > within each element. > > please write out a loop of what you think SNZ should do in simple mode. I > expect that you will find it to be entirely redundant. it's not. i leave you to write it out. remember i am at total burn-out. confrontational approaches like this are just dangerous, now. anyone asks me to do *anything* the automatic answer is "100 i will not". however if *you* do it i will gladly walk through it and answer questions. it also does not help that i am having difficulty recalling tests and spec pages i wrote 6-18 months ago. > > > Another thing I realized while working on divmod, it would be really handy > > > if stuff with a scalar destination would run through all elements instead of > > > stopping at the first, I didn't check if we do that. > > > > that's exactly what /mr does. > > I know that, the problem is /mr can not be set at the same time as > fail-first because they are options in a 2-bit enum-like field arse, you're right. this stuff has dropped out of my electrical memory. (you can pick > *only one* of simple, map-reduce, or fail-first). so either fail-first > implies /mr behavior or not. I think it should imply /mr behavior. let me think about it. can you please also walk through the implications for if anyone would *want* scalar destination and terminate at first? both HF and VF.
(In reply to Jacob Lifshay from comment #6) > I expect you'll end up with something like: > > for i in range(VL): > if CR[i].lt: > RT[i] = RA[i] + RB[i] > pred = nothing-really > else: > pred = SNZ > # nothing uses pred! this pseudocode is wrong, it is not how SNZ works with DDFF. please read the spec pseudocode and if nothing else ISACaller.
(In reply to Luke Kenneth Casson Leighton from comment #8) > (In reply to Jacob Lifshay from comment #6) > > > I expect you'll end up with something like: > > > > for i in range(VL): > > if CR[i].lt: > > RT[i] = RA[i] + RB[i] > > pred = nothing-really > > else: > > pred = SNZ > > # nothing uses pred! > > this pseudocode is wrong, it is not how SNZ works with > DDFF. please read the spec pseudocode and if nothing > else ISACaller. this is for *simple mode*, which is (also with map-reduce mode) where I'm now objecting to introducing SNZ. I am no longer objecting to SNZ in dd-ff.
(In reply to Luke Kenneth Casson Leighton from comment #7) > it's not. i leave you to write it out. remember i am at total burn-out. i remembered only after clicking send. sorry.
(In reply to Luke Kenneth Casson Leighton from comment #8) > it is not how SNZ works with > DDFF. please read the spec pseudocode and if nothing > else ISACaller. I am unable to locate pseudo-code on the wiki for cr-based dd-ff, I tried searching for data-dependent and searching ls010 (iirc includes all relevant spec pages) for SNZ, the only pseudocode i saw was for branches. This needs to be corrected. I'll report a bug. I'll look at ISACaller next.
(In reply to Jacob Lifshay from comment #9) > this is for *simple mode*, which is (also with map-reduce mode) where I'm > now objecting to introducing SNZ. look again at the table. SNZ is not present. it depends critically on there being some sort of test-and-stop (sv.bc has branch-fail/success, DDFF has "end-loop") there is no SNZ for simple mode. this is how it works: https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/branches.mdwn;h=7aea7936fcff5fb7709bae9459255c593185af4d;hb=a3f5eea083533cb3828f0d2a4375c49e2a7b30f1#l598
(In reply to Luke Kenneth Casson Leighton from comment #12) > look again at the table. SNZ is not present. This table: (In reply to Luke Kenneth Casson Leighton from comment #2) > |6 | 7 |19:20|21 | 22:23 | description | > |--|---|-----|---|---------|------------------| > |RG|SNZ|0 0 |/ | dz sz | simple mode | > |RG|SNZ|1 0 |/ | dz sz | scalar reduce mode (mapreduce) | > |RG|SNZ|VLI 1|inv| CR-bit | Ffirst 3-bit mode (implicit zz=1) | > |RG|SNZ|VLI 1|inv| dz sz | Ffirst 5-bit mode (implies CR-bit from result)|
(In reply to Jacob Lifshay from comment #13) > (In reply to Luke Kenneth Casson Leighton from comment #12) > > look again at the table. SNZ is not present. > > This table: current spec https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/cr_ops.mdwn;h=d4e81fc46e32cde93ec92c0d764ee16706b4fea9;hb=a3f5eea083533cb3828f0d2a4375c49e2a7b30f1#l79 SNZ cannot be added to simple or mapreduce as it is meaningless. > (In reply to Luke Kenneth Casson Leighton from comment #2) > > |6 | 7 |19:20|21 | 22:23 | description | > > |--|---|-----|---|---------|------------------| > > |RG|SNZ|0 0 |/ | dz sz | simple mode | > > |RG|SNZ|1 0 |/ | dz sz | scalar reduce mode (mapreduce) | > > |RG|SNZ|VLI 1|inv| CR-bit | Ffirst 3-bit mode (implicit zz=1) | > > |RG|SNZ|VLI 1|inv| dz sz | Ffirst 5-bit mode (implies CR-bit from result)| comment #2, that's me... ah my mistake, i am having difficulty recalling the spec, it has been too many months since i looked at this and my knowledge is no longer in "electrical neuron memory", you'll need to be patient, it will take a couple of days of thinking re-reading and conversation for everything to come back.
ok i have edited comment #0 to put current and proposed spec change, must go over implications for HF and VF. thought experiment necessary. some simple examples. jacob can you illustrate what c code you'd like to use sv.cmpi/ff/rg with? i can kinda grok it but for due diligence it is good to spell out (and will make the basis of an illustrative unit test)
(In reply to Jacob Lifshay from comment #3) > Another thing I realized while working on divmod, it would be really handy > if stuff with a scalar destination would run through all elements instead of > stopping at the first, I didn't check if we do that. This is useful for not > overwriting all the CRs that you could be storing some other variable in. > > e.g.: > sv.cmpi/ff=lt 0, 1, *10, 5 > is: > i = 0 > while i < VL: > CR0 = cmpd(gpr[10 + i], 5) > if CR0.lt: > break > i += 1 > VL = i > > note how only CR0 is ever written and yet the whole vector loop is run yes, i just discovered the exact same thing is needed for Fortran MAXLOC (https://bugs.libre-soc.org/show_bug.cgi?id=676#c26) there it is necessary to use RT=RA=scalar on sv.maxs./ff=lt to get it to "track and return" the largest number *in a scalar* of the vector of source numbers. this implies *changing normal mode as well*
Summary: Data-dependent fail-on-first - learnt the concept using the strncpy example. https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_ldst.py Gained understanding about control registers and compare function. strncpy copies the contents of one string to the other. When you find there is a null character the remaining bits of the result are padded with further null characters. In the example, the fail first instruction will truncate the VL at the first instance of zero byte. Then when storing VL bytes it will not overrun memory- update r12 address. To append the remaining bytes with zeroes, we decrease CTR by vector length and stop when it is equal to 0.
(In reply to Jacob Lifshay from comment #3) > e.g.: > sv.cmpi/ff=lt 0, 1, *10, 5 > is: > i = 0 > while i < VL: > CR0 = cmpd(gpr[10 + i], 5) > if CR0.lt: > break > i += 1 > VL = i > > note how only CR0 is ever written and yet the whole vector loop is run The code added below is incontrast to the one above as it takes the vector value for register 0 in place of the scalar 0. sv.cmpi/ff=lt *0, 1, *10, 5 is: i = 0 while i < VL: CR[i] = cmpd(gpr[10 + i], 5) if CR[i].lt: break i += 1 VL = i
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=b077590 ok this unit test now works, that unblocks both maxloc bug #676
https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=ea249820c03 update spec "quirks" page to describe ddffirst has implicit mapreduce mode
The motivating example: https://bugs.libre-soc.org/show_bug.cgi?id=1044#c80