predication needs to be added to TestIssuer, both single and twin * INT-based single: TODO * CR-based single: TODO * INT-based twin: TODO * CR-based twin: TODO * Zeroing single: TODO - bug #636 * Zeroing twin: TODO - bug #636 cesar it is perfectly reasonable to assume that the whole of the predicate mask is read *in advance* of its actual use. so a FSM which reads individual CR bits and extracts the required field but takes dozens of cycles to do so is perfectly fine. i have updated the spec to say that if instructions are issued which partially overwrite CRs that should be in use as predicates, this is "undefined" behaviour.
Cesar i've added a couple of extra ports so as to be able to read the predicates. INT will be easier, it can be done one-shot. CR is different, a little more complex. diff --git a/src/soc/regfile/regfiles.py b/src/soc/regfile/regfiles.py index aec56e44..2512d3ae 100644 --- a/src/soc/regfile/regfiles.py +++ b/src/soc/regfile/regfiles.py @@ -77,6 +77,7 @@ class IntRegs(RegFileMem): #class IntRegs(RegFileArray): self.r_ports = {'ra': self.read_port("src1"), 'rb': self.read_port("src2"), 'rc': self.read_port("src3"), + 'pred': self.read_port("pred"), # for predicate mask 'dmi': self.read_port("dmi")} # needed for Debug (DMI) @@ -129,6 +130,7 @@ class CRRegs(VirtualRegPort): 'cr_b': self.write_port("dest2")} # 4-bit, unary-indexed self.r_ports = {'full_cr': self.full_rd, # 32-bit (masked, 8-en lines) 'full_cr_dbg': self.full_rd2, # for DMI + 'cr_pred': self.read_port("cr_pred"), # for predicate 'cr_a': self.read_port("src1"), 'cr_b': self.read_port("src2"), 'cr_c': self.read_port("src3")}
Cesar when it comes to it, here's how to skip over src and dest elements that have predication bits zero: while (((1<<srcstep) & srcmask) == 0) and (srcstep != vl): print (" skip", bin(1<<srcstep)) srcstep += 1 # same for dststep while (((1<<dststep) & dstmask) == 0) and (dststep != vl): print (" skip", bin(1<<dststep)) dststep += 1 it's actually very simple, hilarious thing is, this is pretty much the exact same behaviour as "set before first" predicate mask instruction. so that would go in the... err... issue FSM, before triggering execute. the only thing to watch out for is that this changes the srcstep and dststep used in PowerDecoder2 (svstate) so you have to give PowerDecoder2 a chance to settle after skipping srcstep and dststep https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/decoder/isa/caller.py;h=23c1c5aeaed589bbed957d4ea2d86153295ada19;hb=HEAD#l946
so, just so you know, the final pseudo-code will be something like this: if not src_zeroing: while (((1<<srcstep) & srcmask) == 0) and (srcstep != vl): print (" skip", bin(1<<srcstep)) srcstep += 1 if not dest_zeroing: # same for dststep while (((1<<dststep) & dstmask) == 0) and (dststep != vl): print (" skip", bin(1<<dststep)) dststep += 1 if src_zeroing and ((1<<srcstep) & srcmask) == 0: RA = 0 RB = 0 else: RA = get_register_RA RB = get_register_RB result, Condition_Register = calc_operation(RA, RB) if dest_zeroing and ((1<<dststep) & srcmask) == 0): result = 0 Condition_Register = EQzero store_result(result) if Rc=1: store_cr(Condition_Register) so the first phase is to add src-step and dst-step "skipping", based on which bits of the predicate mask(s) are zero the second phase adds in src_zeroing / dest_zeroing which does NOT skip the computation but instead feeds **ZEROs** into either the input or the output i leave it up to you whether you would prefer to implement the whole FSM there, i will "catch up" adding zeroing unit tests.
added a stub here: def fetch_predicate_fsm(self, m, core, TODO): """fetch_predicate_fsm - obtains (constructs in the case of CR) src/dest predicate masks
here's how CR predication is read in ISACaller, it is done as a for-loop from 0 to VL-1, *in advance* of execution. https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/decoder/isa/caller.py;hb=74e1cb79e40d1a7c66f7ad0b942cbe654859eeda#l250
(In reply to Luke Kenneth Casson Leighton from comment #3) > so the first phase is to add src-step and dst-step "skipping", based > on which bits of the predicate mask(s) are zero this bit is fairly easy and involves TestIssuer FSM only. > the second phase adds in src_zeroing / dest_zeroing which does NOT > skip the computation but instead feeds **ZEROs** into either the > input or the output this one is very intrusive into the Core class because it stops read and write of registers. in ISACaller it is relatively straightforward to do this, but hooking into Core connections to the regfiles requires masking and telling the CompUnits they do not need to read/write regfiles.
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/decoder/isa/caller.py;h=b52f88c5db016b57ca2aaf070c31e48d58d135e2;hb=405f5eee9d01ab896d4d85df7df4921d3a4a422b#l963 zeroing detected here for this src/dst step, by ANDing the mask bit.
cesar, my thoughts are that running the pipelines with zero inputs is not useful, and that instead this is more useful and less... puzzling: if (src_zeroing and ((1<<srcstep) & srcmask) == 0) or (dest_zeroing and ((1<<dststep) & srcmask) == 0)): result = 0 Condition_Register = EQzero else: RA = get_register_RA RB = get_register_RB result, Condition_Register = calc_operation(RA, RB) http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-March/002226.html
commit 49b01d0c4382c4035fa7274d5051a6e36a85a2a0 (HEAD -> master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Sun Mar 21 13:25:26 2021 +0000 code comments in TestIssuer
https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=b63f2dcfaac1a7527a0007950913dffb30dc99e2 whoops i forgot to say that when get_intpred returns zero in the reg number it means "use always" i.e. 0b1111111111...
cesar, it looks like there needs to be an extra state which goes here: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/issuer.py;hb=HEAD#l573 or, perhaps, inside the predicate fsm, advances srcstep/dststep there, before indicating that execute is to proceed? i can see otherwise that quite a lot of wait states would be introduced.
(In reply to Luke Kenneth Casson Leighton from comment #11) > or, perhaps, inside the predicate fsm, advances srcstep/dststep there, > before indicating that execute is to proceed? nope, because that is once per instruction, and the skipping needs to be per VL loop
for skipping it is probably simpler and shorter to use this: * already_done = (1<<srcstep) - 1 # zero on start * temp = predicatemask | already_done * startfrom_srcstep = cntlzero(temp) also it starts from srcstep so is re-entrant for return from interrupts. PriorityEncoder should give the right count. a clock worth of re-decoding for PowerDecoder2 will still be needed.
INT and CR predication work! (without zeroing mode) All current predication tests pass! DONE: * new FSM for predicate fetch * INT predicate fetch * add src/dest step VL-loop on Issue FSM, skipping consecutive zeros in the mask * synchronize TestIssuer and ISACaller on vector ops * CR predicate fetch * shifting-out skipped masks bits for both CR and INT pred, for reentrancy * a few MSB0 fixes along the way TODO: * Zeroing single * Zeroing twin
(In reply to Cesar Strauss from comment #14) > INT and CR predication work! (without zeroing mode) > All current predication tests pass! ha, fantastic :) that's a big deal. zeroing will need some changes to the ALUs, first. to test exceptions, realistically LDST first needs exceptions. there are other ways (DEC, TB)
(In reply to Cesar Strauss from comment #14) > TODO: > * Zeroing single > * Zeroing twin ok i've added the basics, it's a bit of a hack, going into soc/fu/common_output_stage.py and common_input_stage.py with a single bit of the predicate mask invert-and-ANDed with the pred-dz or pred-sz mode field, and passed through to all Satellite Decoders (PowerSubsetDecoder) in core. here's the code-chain with line numbers (or diffs): https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=72525d32c291de7e5e8e65af03181257d33d73a0 https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=fceb766a07dc1272d6a03fd024d07f44258084ef https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=add5a87a46cb4699f6d812a9b9f462dbc7184f86 https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=5e26daa4c8d0e3a31ac5edb7def49418a46c1537 https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=8a319f43c4cb812049e062061b92949b42a57399 https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=095318372c9d0d1fc7817e1b905556066f406a35 https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=60f562d96fa0d09e5f31f9ae3aa959ea2e9e6e9a the chain is: * Input Records all have (new) SVP64 fields (sv_pred_sz, sv_saturate) * all Satellite PowerSubsetDecoders have these new fields * pipelines receive that in the ctx.op so can make "decisions" * relevant predicate mask bits are passed through to pipelines as well