Bug 617 - add SVP64 predication to TestIssuer
Summary: add SVP64 predication to TestIssuer
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Cesar Strauss
URL: https://libre-soc.org/openpower/sv/im...
Depends on: 588
Blocks:
  Show dependency treegraph
 
Reported: 2021-03-17 13:20 GMT by Luke Kenneth Casson Leighton
Modified: 2021-05-06 18:57 BST (History)
1 user (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2021-03-17 13:20:07 GMT
predication needs to be added to TestIssuer, both single and twin

* INT-based single: TODO
* CR-based single:  TODO
* INT-based twin:   TODO
* CR-based twin:    TODO
* Zeroing single:   TODO - bug #636
* Zeroing twin:     TODO - bug #636


cesar it is perfectly reasonable to assume that the whole of the predicate
mask is read *in advance* of its actual use.  so a FSM which reads individual
CR bits and extracts the required field but takes dozens of cycles to do
so is perfectly fine.

i have updated the spec to say that if instructions are issued which partially
overwrite CRs that should be in use as predicates, this is "undefined"
behaviour.
Comment 1 Luke Kenneth Casson Leighton 2021-03-17 13:29:41 GMT
Cesar i've added a couple of extra ports so as to be able to read the
predicates.  INT will be easier, it can be done one-shot.  CR is
different, a little more complex.

diff --git a/src/soc/regfile/regfiles.py b/src/soc/regfile/regfiles.py
index aec56e44..2512d3ae 100644
--- a/src/soc/regfile/regfiles.py
+++ b/src/soc/regfile/regfiles.py
@@ -77,6 +77,7 @@ class IntRegs(RegFileMem): #class IntRegs(RegFileArray):
         self.r_ports = {'ra': self.read_port("src1"),
                         'rb': self.read_port("src2"),
                         'rc': self.read_port("src3"),
+                        'pred': self.read_port("pred"), # for predicate mask
                         'dmi': self.read_port("dmi")} # needed for Debug (DMI)
 
 
@@ -129,6 +130,7 @@ class CRRegs(VirtualRegPort):
                         'cr_b': self.write_port("dest2")} # 4-bit, unary-indexed
         self.r_ports = {'full_cr': self.full_rd, # 32-bit (masked, 8-en lines)
                         'full_cr_dbg': self.full_rd2, # for DMI
+                        'cr_pred': self.read_port("cr_pred"), # for predicate
                         'cr_a': self.read_port("src1"),
                         'cr_b': self.read_port("src2"),
                         'cr_c': self.read_port("src3")}
Comment 2 Luke Kenneth Casson Leighton 2021-03-17 22:16:54 GMT
Cesar when it comes to it, here's how to skip over src and dest elements
that have predication bits zero:

            while (((1<<srcstep) & srcmask) == 0) and (srcstep != vl):
                print ("      skip", bin(1<<srcstep))
                srcstep += 1

            # same for dststep
            while (((1<<dststep) & dstmask) == 0) and (dststep != vl):
                print ("      skip", bin(1<<dststep))
                dststep += 1

it's actually very simple, hilarious thing is, this is pretty much the exact
same behaviour as "set before first" predicate mask instruction.

so that would go in the... err... issue FSM, before triggering execute.

the only thing to watch out for is that this changes the srcstep and
dststep used in PowerDecoder2 (svstate) so you have to give PowerDecoder2
a chance to settle after skipping srcstep and dststep

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/decoder/isa/caller.py;h=23c1c5aeaed589bbed957d4ea2d86153295ada19;hb=HEAD#l946
Comment 3 Luke Kenneth Casson Leighton 2021-03-18 14:26:46 GMT
so, just so you know, the final pseudo-code will be something like this:

       if not src_zeroing:
            while (((1<<srcstep) & srcmask) == 0) and (srcstep != vl):
                print ("      skip", bin(1<<srcstep))
                srcstep += 1

       if not dest_zeroing:
            # same for dststep
            while (((1<<dststep) & dstmask) == 0) and (dststep != vl):
                print ("      skip", bin(1<<dststep))
                dststep += 1

       if src_zeroing and ((1<<srcstep) & srcmask) == 0:
            RA = 0
            RB = 0
       else:
            RA = get_register_RA
            RB = get_register_RB

       result, Condition_Register = calc_operation(RA, RB)

       if dest_zeroing and ((1<<dststep) & srcmask) == 0):
            result = 0
            Condition_Register = EQzero

       store_result(result)
       if Rc=1: store_cr(Condition_Register)


so the first phase is to add src-step and dst-step "skipping", based
on which bits of the predicate mask(s) are zero

the second phase adds in src_zeroing / dest_zeroing which does NOT
skip the computation but instead feeds **ZEROs** into either the
input or the output

i leave it up to you whether you would prefer to implement the whole
FSM there, i will "catch up" adding zeroing unit tests.
Comment 4 Luke Kenneth Casson Leighton 2021-03-18 19:16:08 GMT
added a stub here:

    def fetch_predicate_fsm(self, m, core, TODO):
        """fetch_predicate_fsm - obtains (constructs in the case of CR)
           src/dest predicate masks
Comment 5 Luke Kenneth Casson Leighton 2021-03-18 21:21:43 GMT
here's how CR predication is read in ISACaller, it is done as a for-loop
from 0 to VL-1, *in advance* of execution.

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/decoder/isa/caller.py;hb=74e1cb79e40d1a7c66f7ad0b942cbe654859eeda#l250
Comment 6 Luke Kenneth Casson Leighton 2021-03-20 11:56:55 GMT
(In reply to Luke Kenneth Casson Leighton from comment #3)

> so the first phase is to add src-step and dst-step "skipping", based
> on which bits of the predicate mask(s) are zero

this bit is fairly easy and involves TestIssuer FSM only.

> the second phase adds in src_zeroing / dest_zeroing which does NOT
> skip the computation but instead feeds **ZEROs** into either the
> input or the output

this one is very intrusive into the Core class because it stops read and write of registers.

in ISACaller it is relatively straightforward to do this, but hooking into Core connections to the regfiles requires masking and telling the CompUnits they do not need to read/write regfiles.
Comment 7 Luke Kenneth Casson Leighton 2021-03-20 12:06:04 GMT
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/decoder/isa/caller.py;h=b52f88c5db016b57ca2aaf070c31e48d58d135e2;hb=405f5eee9d01ab896d4d85df7df4921d3a4a422b#l963

zeroing detected here for this src/dst step, by ANDing the mask bit.
Comment 8 Luke Kenneth Casson Leighton 2021-03-21 13:12:58 GMT
cesar, my thoughts are that running the pipelines with zero inputs is not
useful, and that instead this is more useful and less... puzzling:

       if (src_zeroing and ((1<<srcstep) & srcmask) == 0) or
          (dest_zeroing and ((1<<dststep) & srcmask) == 0)):
            result = 0
            Condition_Register = EQzero
       else:
            RA = get_register_RA
            RB = get_register_RB
            result, Condition_Register = calc_operation(RA, RB)

http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-March/002226.html
Comment 9 Luke Kenneth Casson Leighton 2021-03-21 13:25:42 GMT
commit 49b01d0c4382c4035fa7274d5051a6e36a85a2a0 (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Sun Mar 21 13:25:26 2021 +0000

    code comments in TestIssuer
Comment 10 Luke Kenneth Casson Leighton 2021-03-22 04:58:54 GMT
https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=b63f2dcfaac1a7527a0007950913dffb30dc99e2

whoops i forgot to say that when get_intpred returns zero in the reg number it means "use always" i.e. 0b1111111111...
Comment 11 Luke Kenneth Casson Leighton 2021-03-22 22:20:54 GMT
cesar, it looks like there needs to be an extra state which goes here:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/issuer.py;hb=HEAD#l573

or, perhaps, inside the predicate fsm, advances srcstep/dststep there, before indicating that execute is to proceed?

i can see otherwise that quite a lot of wait states would be introduced.
Comment 12 Luke Kenneth Casson Leighton 2021-03-23 03:53:18 GMT
(In reply to Luke Kenneth Casson Leighton from comment #11)

> or, perhaps, inside the predicate fsm, advances srcstep/dststep there,
> before indicating that execute is to proceed?

nope, because that is once per instruction, and the skipping needs to be per VL loop
Comment 13 Luke Kenneth Casson Leighton 2021-03-24 05:12:32 GMT
for skipping it is probably simpler and shorter to use this:

* already_done = (1<<srcstep) - 1 # zero on start
* temp = predicatemask | already_done
* startfrom_srcstep = cntlzero(temp)

also it starts from srcstep so is re-entrant for return from interrupts.

PriorityEncoder should give the right count.  a clock worth of re-decoding for PowerDecoder2 will still be needed.
Comment 14 Cesar Strauss 2021-04-25 21:28:51 BST
INT and CR predication work! (without zeroing mode)
All current predication tests pass!

DONE:
* new FSM for predicate fetch
* INT predicate fetch
* add src/dest step VL-loop on Issue FSM, skipping consecutive zeros in the mask
* synchronize TestIssuer and ISACaller on vector ops
* CR predicate fetch
* shifting-out skipped masks bits for both CR and INT pred, for reentrancy
* a few MSB0 fixes along the way

TODO:
* Zeroing single
* Zeroing twin
Comment 15 Luke Kenneth Casson Leighton 2021-04-26 17:47:47 BST
(In reply to Cesar Strauss from comment #14)
> INT and CR predication work! (without zeroing mode)
> All current predication tests pass!

ha, fantastic :)  that's a big deal.

zeroing will need some changes to the ALUs, first.  to test exceptions, realistically LDST first needs exceptions.  there are other ways (DEC, TB)
Comment 16 Luke Kenneth Casson Leighton 2021-05-06 18:57:33 BST
(In reply to Cesar Strauss from comment #14)

> TODO:
> * Zeroing single
> * Zeroing twin

ok i've added the basics, it's a bit of a hack, going into
soc/fu/common_output_stage.py and common_input_stage.py
with a single bit of the predicate mask invert-and-ANDed
with the pred-dz or pred-sz mode field, and passed through
to all Satellite Decoders (PowerSubsetDecoder) in core.

here's the code-chain with line numbers (or diffs):

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=72525d32c291de7e5e8e65af03181257d33d73a0

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=fceb766a07dc1272d6a03fd024d07f44258084ef

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=add5a87a46cb4699f6d812a9b9f462dbc7184f86

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=5e26daa4c8d0e3a31ac5edb7def49418a46c1537

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=8a319f43c4cb812049e062061b92949b42a57399

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=095318372c9d0d1fc7817e1b905556066f406a35

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=60f562d96fa0d09e5f31f9ae3aa959ea2e9e6e9a

the chain is:

* Input Records all have (new) SVP64 fields (sv_pred_sz, sv_saturate)
* all Satellite PowerSubsetDecoders have these new fields
* pipelines receive that in the ctx.op so can make "decisions"
* relevant predicate mask bits are passed through to pipelines as well