664 – design SVP64 branch instructions (sv.bc)

Bug 664 - design SVP64 branch instructions (sv.bc)

Summary: design SVP64 branch instructions (sv.bc)

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Specification (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Luke Kenneth Casson Leighton

URL:	https://libre-soc.org/openpower/sv/br...

Depends on:
Blocks:	213
	Show dependency tree / graph

Reported:	2021-08-01 19:20 BST by Luke Kenneth Casson Leighton
Modified:	2023-11-27 14:11 GMT (History)
CC List:	2 users (show)

See Also:	687 550 1215
NLnet milestone:	NLNet.2019.10.046.Standards
total budget (EUR) for completion of task and all subtasks:	1500
budget (EUR) for this task, excluding subtasks' budget:	1500
parent task for budget allocation:	213
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:	[jacob] amount = 500 submitted = 2022-06-19 paid = 2022-07-21 [lkcl] amount = 1000 submitted = 2022-06-16 paid = 2022-07-21

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2021-08-01 19:20:06 BST

we completely missed that bc uses CR fields, and thus could
be SVP64 Vectorised.

http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html

Comment 1 Luke Kenneth Casson Leighton 2021-08-08 15:57:25 BST

* design pretty much done
* added SVP64 24-bit RM decoder
* added sv_analysis to include bc and bclr SVP64 EXTRA modes

Comment 2 Luke Kenneth Casson Leighton 2021-08-08 22:06:57 BST

commit 4ad3c373c7fbd826447de0c64b558eeb2b53d174 (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Sun Aug 8 22:06:45 2021 +0100

    add start of SVP64ASM encoder for sv.bc and sv.bclr
    TODO, sv.bca, sv.bclrl etc.

Comment 3 Luke Kenneth Casson Leighton 2021-08-12 16:10:26 BST

svstep mode looks like it is too "CISC-like".  although remarkably
similar to CTR auto-decrement, svstep auto-increment involves predicate
skipping as well as REMAP.  realistically this is too much, unfortunately.

currently implementing sv.bc in ISACaller, this is the first time that
CR fields have been involved.

it would have been much better to have started with sv.crand (etc)
before trying to do sv.bc because then the infrastructure for read/write
of CR Fields would already be in place.

realistically, the pseudocode needs to change from

   CR[BI+32]

to

    CRF(BI[0:2])[BI[3:4]

Comment 4 Luke Kenneth Casson Leighton 2022-05-29 17:04:20 BST

i've added a mode LRu (Link Register Update) which only gets LR updated
if the branch condition succeeds.  actually there is also the option to
only set LR if the branch condition *fails*

however the reason for today's thought is inspired from "why does LR even
exist in the first place" and it is down to making function calls
and returns from function calls possible.

in core.py there were two pieces of info identified (3)

* PC
* MSR
* SVSTATE

these three are what get saved/restored on contextswitch, which
i mention because it helps mentally get a handle on the execution
context.

MSR is global and not involved in functions.  PC gets saved in LR on
function calls... but what about SVSTATE? that needs saving as well.
therefore, the idea is:

* to add an extra SPR, LSVR Link SVState Register
* to extend svp64-branches to include save/swap of SVSTATE
  into/from LSVR
* to utilise more bits from RM (some of RM.EXTRA) to add
  an equivalent of the LK field, for LSVR, and corresponding
  variant of LRu.

if not included then any function call or loop involving
SVP64 will have to perform explicit backups of SVSTATE (mtspr,
mfspr) which will get very tedious very quickly.  this starts
to matter a lot more in VerticalFirst Mode than in Horizontal
because there is the potential for calling functions from
within VerticalFirst loops.

Comment 5 Luke Kenneth Casson Leighton 2022-05-29 17:31:08 BST

Pseudo-code:

    if AA then NIA  <-iea EXTS(LI || 0b00)
    else       NIA  <-iea CIA + EXTS(LI || 0b00)
    if LK then LR <-iea  CIA + 4

becomes:

    if AA then NIA  <-iea EXTS(LI || 0b00)
    else       NIA  <-iea CIA + EXTS(LI || 0b00)
    if LK then LR <-iea  CIA + 4
    if LSV then LSVR <- SVSTATE

Comment 6 Luke Kenneth Casson Leighton 2022-05-30 08:40:14 BST

bclr pseudocode:

    if (mode_is_64bit) then M <- 0
    else M <- 32
    if ¬BO[2]  then CTR <- CTR - 1
    ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
    cond_ok <- BO[0] | ¬(CR[BI+32] ^  BO[1])
    if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
    if LK then LR <-iea CIA + 4

has to become a little more sophisticated:

    lr_ok <- LK
    lsv_ok <- LSVR
    if ctr_ok & cond_ok then
       NIA <-iea LR[0:61] || 0b00
       if SVRMmode.LSV then SVSTATE.next <- LSVR
       if SVRMmode.LRu then lr_ok <- ¬lr_ok
       if SVRMmode.LSVu then lsv_ok <- ¬lsv_ok
    if lr_ok then LR <-iea CIA + 4
    if lsv_ok then LSVR <- SVSTATE

* SVSTATE as well as LR is being captured in LR and LSVR
* SVSTATE as well as LR has the option to be updated if
  the condition test succeeds

Comment 7 Jacob Lifshay 2022-05-31 05:28:59 BST

imho adding sv context switching within a sv vertical loop is too complex...

Comment 8 Luke Kenneth Casson Leighton 2022-05-31 09:40:36 BST

(In reply to Jacob Lifshay from comment #7)
> imho adding sv context switching within a sv vertical loop is too complex.

my feeling here it's going to be attempted whether we like it or not.
therefore, it's down to us to explore that space, in advance, and to
make it a smooth rather than a rough ride.

if a lot of mfspr/mtspr instructions on SVSTATE have to be used, that's
a rough ride.  mfspr SVSTATE bclr mtspr SVSTATE repeat repeat repeat repeat.

the concept comes from the ZOLC "stack", and with the low-level HDL
blocks being literally identical in ZOLC as they are for Matrix REMAP
it's conceptually not that hard.

full ZOLC on the other hand: *that's* hard, and i definitely don't want
to go that route, it'll take months to understand (just ZOLC alone)

saving of SVSTATE for stack/nesting purposes during a Vertical-First function,
in order to perform a localised Horizontal-First localised piece of work,
was always inevitable.  normally it's "only" the PC that's saved on-stack,
because that's the primary context.

now SVSTATE is part of that primary context.

it's not that hard to conceptualise, especially given that SVSTATE is
effectively a sub-program-counter: this has been in the documentation
and in presentations on SV for over two years as an introductory
statement.

Comment 9 Jacob Lifshay 2022-05-31 09:56:44 BST

(In reply to Luke Kenneth Casson Leighton from comment #8)
> (In reply to Jacob Lifshay from comment #7)
> > imho adding sv context switching within a sv vertical loop is too complex.
> 
> my feeling here it's going to be attempted whether we like it or not.
> therefore, it's down to us to explore that space, in advance, and to
> make it a smooth rather than a rough ride.
> 
> if a lot of mfspr/mtspr instructions on SVSTATE have to be used, that's
> a rough ride.  mfspr SVSTATE bclr mtspr SVSTATE repeat repeat repeat repeat.

wait, so you mean a function call inside a sv vertical-first loop? hmm...

anyway:
imho we should limit SV vertical mode to just single loops, otherwise we'll gain more features and end up with something that takes 500k gates just in the sv decoder...we need to stop adding every feature we can think of to SV otherwise x86 will look at us and be proud that their spec is *only* 5000 pages... (idk the real number, but you get the idea)

Comment 10 Luke Kenneth Casson Leighton 2022-05-31 12:27:31 BST

(In reply to Jacob Lifshay from comment #9)

> wait, so you mean a function call inside a sv vertical-first loop? hmm...

uh-huhn :) at least, if not a public one (ABI compliant) a static one.
bare-minimum just above "some stuff went on the stack"

> anyway:
> imho we should limit SV vertical mode to just single loops,

ironically and hilariously doing so is more costly.
this comes down to the strict insistence of being barely
above scalar and on having context-switching rules

MyISA 66000 VVM on the other hand, written by Mitch Alsup,
that *is* a single ZOLC Vertical-First explicit concept.
it's an explicit instruction (a pair), only LD/STs may
be Vectorised, the LOOP instruction tags the counter, loop
invariants, and the registers to be used for LD/ST.

once in-flight the hw is permitted to analyse the RS allocations
and perform automatic SIMD merging of multiple Vertically-allocated
loops. that can only be safely done thanks to the tagging.

now *that's* complicated (and awesome, despite the limitations) :)

SVP64 is a completely different paradigm, with the absolute bare
minimum in hardware.

Vertical-First in SVP64 is literally one bit
that says "no" to incrementing srcstep/dststep, and lets just one
instruction get issued with GPR(RA+srcstep) rather than GPR(RA).

context-switching whether it be a function or it be an interrupt
is literally a matter of saving PC and SVSTATE. (a bit more if REMAP
is engaged, obviously, but that's software's problem to solve)

> otherwise we'll
> gain more features and end up with something that takes 500k gates just in
> the sv decoder...

surprisingly adding LSVR is not that gate-hungry.  it is completely
independent, and already the branch instructions have a different RM
Decode path, but the bit that's equivalent to LK is not tied *to* LK,
it's independent.

> we need to stop adding every feature we can think of to SV
> otherwise x86 will look at us and be proud that their spec is *only* 5000
> pages... (idk the real number, but you get the idea)

120 for everything including 32-bit instructions from bitmanip etc.

most of the effort involved here is stopping brain-melt. Cray
(Horizontal-First) *nobody* does function calls, it's too much.
Vertical-First is so close to scalar it's actually harder to
prevent fn calls, and would need explicit hardware and pretty much
a redesign to prevent.

now, ZOLC, that *is* going to be a whole new bundle-o-fun.  ZOLC
is definitely on the other side of a "large funding" fence.

Comment 11 Luke Kenneth Casson Leighton 2022-05-31 14:21:44 BST

(In reply to Luke Kenneth Casson Leighton from comment #10)

> surprisingly adding LSVR is not that gate-hungry.  it is completely
> independent, and already the branch instructions have a different RM
> Decode path,

there are 4 different decode paths where the 24-bit RM field is interpreted
differently: 1) arithmetic 2) CRs 3) LDST 4) branches. yes it is annoying that
partial decode of the suffix is required in order to begin decoding the
prefix but like in v3.1 prefix more bits woukd be required to provide the
type identification (MTRR, 8LS etc) and we would need to take yet another
major opcode to do it (26 bits needed)