Bug 1023 - crternlut/crbinlut analysis needed of CR regfile usage
Summary: crternlut/crbinlut analysis needed of CR regfile usage
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL: https://libre-soc.org/openpower/sv/rf...
Depends on:
Blocks: 1017
  Show dependency treegraph
 
Reported: 2023-03-14 15:03 GMT by Luke Kenneth Casson Leighton
Modified: 2023-07-14 01:37 BST (History)
2 users (show)

See Also:
NLnet milestone: NLnet.2022-08-051.OPF
total budget (EUR) for completion of task and all subtasks: 1500
budget (EUR) for this task, excluding subtasks' budget: 1500
parent task for budget allocation: 1011
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
[jacob] amount = 300 submitted = 2023-06-28 paid = 2023-07-12 [lkcl] amount = 1200 submitted = 2023-06-22


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2023-03-14 15:03:53 GMT
the number of CR operands if done as 4-wide CR Field Hazard protection
comes out as 4-in 1-out. this is too much. however if the CR is considered
32-bit then both instructions are considered 1-in 1-out.

the complication comes for SVP64. some discussion and analysis of
architectural options is needed.
Comment 1 Luke Kenneth Casson Leighton 2023-03-14 15:29:59 GMT
for scalar use assuming that there is only one 32-bit CR, this is
pretty standard fare already: mtcr, mfcr. thus, for scalar i think
there will be no problem.

in TestIssuer the CR regfile is extremely weird: it is 4-bit ports
mostly but then combined to *unary masking* to create at least two
full 32-bit-wide ports that take read-enable (write-enable) on a
per-4-bit-field basis.
mfxcr is therefore a perfect match for this design, just pass the
incoming mask immediate directly to the regfile.

SVP64 gets complicated because the additional range (128 CR Fields)
means that there would be QTY 16of 32-bit CRs (or, implementors
could choose to make that QTY 8of 64-bit CRs).

thus unless the programmer happens to issue an instruction that
reads (then writes) to/from the same group of 8 CR Fields...

crternluti can be reduced down to 3-in 1-out by making it an
"overwrite". likewise crbinlut.

this does have the significant advantage of greatly reducing opcode
space, as well.

i will see if there is an existing Form.  drat i should have spotted
this 6+ months ago.
Comment 2 Luke Kenneth Casson Leighton 2023-03-14 16:40:01 GMT
* `crternlogi BT, BA, BB, BC, TLI, msk`

| 0.5| 6-8 | 9-11 | 12-14 | 15-17 | 18-20 | 21-28 | 29-30 | 31  | Form     |
|----|-----|------|-------|-------|-------|-------|-------|-----|----------|
| PO | BF  | BFA  | BFB   | BFC   | msk   |  TLI  |  XO   | msk | CRB-Form |

looking at
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/fields.text;hb=HEAD

->

* `crternlogi BT, BA, BB, TLI, msk`

| 0.5| 6-8 | 9-10 | 11-13 | 14-15 | 16-18 | 19-25 | 26-30 | 31  | Form     |
|----|-----|------|-------|-------|-------|-------|-----|----------|
| PO | BF  | msk  | BFA   | msk   |  BFB  | TLI   |  XO | TLI   | CRB-Form |

This is messy but 7 bits of TLI match with TLI-Form, BF BFA and BFB
match with X-Form, the intervening 2 bits being allocated to msk,
the XO being the same length as A-Form, VA-Form, VA2-Form and most
of the SV*-Forms, by starting at bit 26.
Comment 3 Luke Kenneth Casson Leighton 2023-03-14 16:49:17 GMT
crbinlog

| 0.5|6.8 | 9.11|12.14|15.17|18.21|22...30  |31|
| -- | -- | --- | --- | --- |-----| -------- |--|
| NN | BT | BA  | BB  | BC  |m0-m3|000101110 |0 |

->

same trick except i think it possible to just leave out the TLI Field


| 0.5| 6-8 | 9-10 | 11-13 | 14-15 | 16-18 | 19-25 |26-30 | 31 | Form     |
|----|-----|------|-------|-------|-------|-------|------|----|----------|
| PO | BF  | msk  | BFA   | msk   |  BFB  | ///   |  XO  | /  | CRB-Form |

this means decoding both instructions is not costly.
Comment 4 Luke Kenneth Casson Leighton 2023-03-14 16:54:41 GMT
 911     TLI (21:28)
 912          Field used by the ternlogi instruction as the
 913          look-up table.
 914          Formats: TLI

needed:

 911     TLI (21:25,19,20,31)
 912          Field used by the crternlogi instruction as the
 913          look-up table.
 914          Formats: CRB-Form

and msk is missing:

msk (9:10,14:15)
    field used bycrternlogi and crbinlut to select which bits of
    CR Field BF are to be modified.
    Formats: CRB-Form
Comment 5 Jacob Lifshay 2023-03-14 22:36:19 GMT
imho we should copy TLI-form:
|0   |6   |9  |11   |14 |16      |21   |29  |31 |
| PO | BF |msk| BFA |msk| RB/BFB | TLI | XO |Rc |

except we don't need Rc, so:

|0   |6   |9  |11   |14 |16      |21   |28  |31 |
| PO | BF |msk| BFA |msk| RB/BFB | TLI | XO |TLI|

RB/BFB is the LUT.
Comment 6 Jacob Lifshay 2023-03-14 22:40:05 GMT
(In reply to Jacob Lifshay from comment #5)
> imho we should copy TLI-form:
> |0   |6   |9  |11   |14 |16      |21   |29  |31 |
> | PO | BF |msk| BFA |msk| RB/BFB | TLI | XO |Rc |
> 
> except we don't need Rc, so:
> 
> |0   |6   |9  |11   |14 |16      |21   |28  |31 |
> | PO | BF |msk| BFA |msk| RB/BFB | TLI | XO |TLI|
> 
> RB/BFB is the LUT.

oh, actually we only ever need BFB for crternlogi, so two more bits of XO by moving TLI bits over more.

crbinlog is what needs RB/BFB where we can use almost the entire space used by TLI for XO, just need one bit for nh, which should match binlog's nh.
Comment 7 Jacob Lifshay 2023-03-14 22:42:10 GMT
so, yeah, more or less what you came up with, just more space for RB and nh
Comment 8 Luke Kenneth Casson Leighton 2023-03-15 15:07:05 GMT
(In reply to Jacob Lifshay from comment #6)

> crbinlog is what needs RB/BFB where we can use almost the entire space used
> by TLI for XO, just need one bit for nh, which should match binlog's nh.

already explained in detail that the answer is a no on using the GPR
for crbinlog. https://bugs.libre-soc.org/show_bug.cgi?id=1017#c18
Comment 9 Luke Kenneth Casson Leighton 2023-03-15 15:10:09 GMT
commit 8e5326caf15e238494a20cb0937b0c41c0ea9d20 (HEAD -> master, origin/master, origin/HEAD)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Mar 15 15:00:29 2023 +0000

    rewrite crternlogi and crbinlog to match new format, required to
    reduce both instructions to 3-read 1-write.
    https://bugs.libre-soc.org/show_bug.cgi?id=1023#c2

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=8e5326caf15e238494a20cb0937b0c41c0ea9d20

--

commit aa9020ffed1996c7ab3526a5cfcf906ed61eeb04 (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Mar 15 15:09:35 2023 +0000

    add CRB-Form fields for crternlogi and crbinlog, they are both now
    reduced to 3-in 1-out, both needing to become overwrites due to the
    mask field (msk) making BF a Read-Modify-Write
    https://bugs.libre-soc.org/show_bug.cgi?id=1023#c4

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=aa9020ffed1996c7ab3526a5cfcf906ed61eeb04
Comment 10 Luke Kenneth Casson Leighton 2023-03-22 12:27:00 GMT
the answer here remained a firm "NO" following a detailed and
unscheduled explanation of Dependency Matrces.

the premise that grouping similar instructions together is ok because they
already have GPR and CR paths from regfiles was demonstrated to be false.


grouping of instructions behind shared Reservation Stations into the
same pipelines requires the DM Cells to contain the *union* of the Register
Profiles for those pipelines. thus whilst it is fine to group ALU and
Logical together because they both share XER CR0 RT RS RA and RB,
it is *not* fine to group madd with that same group because RC incteases
the size of *all* the DM Cells by 15% just for serving that one extra
register.

thus grouping CR Ops with these instructions similarly increases the
total number of registers handled by every DM Cell in front of those
RSes.

this is why the CDC6600 and the 68000 have 3 separate register files with
very little crossover instructions as it keeps the DMs lean and sparse,
right where they are critically important to keep gate count down.

jacob was unaware of all of this and it had to be explained under duress.