Bug 905 - create Scalar reg access encoding (SVP64-Single)
Summary: create Scalar reg access encoding (SVP64-Single)
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL: http://lists.libre-soc.org/pipermail/...
Depends on:
Blocks: 952 1212
  Show dependency treegraph
 
Reported: 2022-08-07 12:36 BST by Luke Kenneth Casson Leighton
Modified: 2024-02-19 12:54 GMT (History)
2 users (show)

See Also:
NLnet milestone: NLnet.2022-08-051.OPF
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2022-08-07 12:36:19 BST
use one of the 64 EXT001 areas to allow full access to all
scalar regs.  needs an entirely new type of EXTRA encoding
exclusively dedicated to scalar.

* must be capable of reaching
  CR0..CR127 even for 3-arg CR operations.
* must be capable of reaching GPR0..127 and
  FPR0..127 even for 4-arg ALU ops (fmadd, isel)

provide predication (just one bit, single-only not twin),
element-width-overrides (source and dest),
saturation.

edit: also up for consideration / discussion: should subvl
be an option?

edit: https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-October/005342.html
an idea for having sv.op RT.scalar RA.scalar RB.scalar be
an encoding for "hard-set VL=1".
Comment 1 Luke Kenneth Casson Leighton 2022-09-10 22:23:30 BST
thinking out loud for bitallocations

* 12 EXTRA4x3
* 4 src/dst 2x2 elwidth
* 4 predicate mask 1 type 3 select
* 1 saturate

totals 21, 3 spare. EXTRA5? leave as reserved?
Comment 2 Jacob Lifshay 2022-09-11 06:11:33 BST
(In reply to Luke Kenneth Casson Leighton from comment #1)
> thinking out loud for bitallocations
> 
> * 12 EXTRA4x3

alternately 12 EXTRA3 x4

> * 4 src/dst 2x2 elwidth
> * 4 predicate mask 1 type 3 select
> * 1 saturate

2 saturate -- unsigned/signed saturate differently, we'll want both -- this should be encoded as a signed/unsigned bit and a saturate bit, since the signed/unsigned bit is also useful for deciding to sign/zero extend inputs/outputs when src/destelwid differ.

also fp ops have different values worth saturating to:
* standard 0.0 to 1.0
* standard -1.0 to 1.0
* i32? u32? i8? u8? i16? u16?

> 
> totals 21, 3 spare. EXTRA5? leave as reserved?

imho we should leave as reserved because that reduces the amount of space required to 1/4th (accounting for 2 saturate bits) as much as svp64, greatly reducing opcode pressure.
Comment 3 Jacob Lifshay 2022-09-11 06:36:47 BST
(In reply to Jacob Lifshay from comment #2)
> > 
> > totals 21, 3 spare. EXTRA5? leave as reserved?
> 
> imho we should leave as reserved because that reduces the amount of space
> required to 1/4th (accounting for 2 saturate bits) as much as svp64, greatly
> reducing opcode pressure.

though maybe use 1 bit so loads/stores can be pc-relative, like the v3.1 prefix's R bit. this can greatly reduce needed instruction counts to access constants/variables, though imho we should differ from v3.1's R bit:
we should always add (RA|0) and just tack on CIA as a new addend, rather than replacing (RA|0) with CIA, this allows stuff like keeping jump tables and other lookup tables easily accessible:

pseudocode i'm proposing for SV-single:
if SVSR = 0 then  # SV-single's R field
    addr <- (RA|0) + EXTS64(D)  # normal ld/std
else
    addr <- CIA + (RA|0) + EXT64(D)  # pc-rel ld/std

note that the CIA + D add can be done well before the execute stage, since both are known values at the decode stage, avoiding extra latency due to more complex addr calculation in ld/st pipe

example jump table with pc-rel, r3 on input is 0, 8, or 16:

table:
.long label1
.long label2
.long label3
...
code:
svs.ld r3, table@pcrel(r3), 1
mtctr r3
bctr

equivalent v3.1 code:
table:
.long label1
.long label2
.long label3
...
code:
paddi r3, r3, table@pcrel, 1
ld r3, 0(r3)  # can't use pld R=1 since it ignores input registers
mtctr r3
bctr
Comment 4 Jacob Lifshay 2022-09-11 06:40:33 BST
(In reply to Jacob Lifshay from comment #3)
> equivalent v3.1 code:
> table:
> .long label1
> .long label2
> .long label3
> ...
> code:
> paddi r3, r3, table@pcrel, 1
> ld r3, 0(r3)  # can't use pld R=1 since it ignores input registers
> mtctr r3
> bctr

actually can't use paddi like that since it has the same issue as pld R=1

fixed:
code:
paddi r4, 0, table@pcrel, 1
ldx r3, r3, r4  # can't use pld R=1 since it ignores input registers
mtctr r3
bctr
Comment 5 Luke Kenneth Casson Leighton 2022-09-11 11:14:47 BST
(In reply to Jacob Lifshay from comment #2)
> (In reply to Luke Kenneth Casson Leighton from comment #1)
> > thinking out loud for bitallocations
> > 
> > * 12 EXTRA4x3
> 
> alternately 12 EXTRA3 x4

if you meant "3 extra scalar bits", yes.

[SVP64.EXTRA3 contains scalar/vector - vector is redundant for SVP64Single].

to get up to 2^7 there are 4 extra bits required
because all CR fields are only 3 bit.

> 2 saturate -- unsigned/signed saturate differently, we'll want both -- this
> should be encoded as a signed/unsigned bit and a saturate bit, since the
> signed/unsigned bit is also useful for deciding to sign/zero extend
> inputs/outputs when src/destelwid differ.

ahh ok

> also fp ops have different values worth saturating to:
> * standard 0.0 to 1.0
> * standard -1.0 to 1.0
> * i32? u32? i8? u8? i16? u16?

change of meaning of the operation must not occur.  saturate to
those values *and still be a float* (i.e. convert to int, saturate,
then convert back to float), not a problem.  change the meaning
of the operation to store in an *int*, we can't do that.

but... hmmm... saturating on some of the fp-to-int operations,
yes not a problem there.


> > 
> > totals 21, 3 spare. EXTRA5? leave as reserved?
> 
> imho we should leave as reserved because that reduces the amount of space
> required to 1/4th (accounting for 2 saturate bits) as much as svp64, greatly
> reducing opcode pressure.

there's no reserved encoding *at all* for SVP64 so yes, good to start
doing that.
Comment 6 Luke Kenneth Casson Leighton 2022-09-11 11:24:16 BST
(In reply to Jacob Lifshay from comment #3)
> (In reply to Jacob Lifshay from comment #2)
> > > 
> > > totals 21, 3 spare. EXTRA5? leave as reserved?
> > 
> > imho we should leave as reserved because that reduces the amount of space
> > required to 1/4th (accounting for 2 saturate bits) as much as svp64, greatly
> > reducing opcode pressure.
> 
> though maybe use 1 bit so loads/stores can be pc-relative, like the v3.1
> prefix's R bit. this can greatly reduce needed instruction counts to access
> constants/variables, though imho we should differ from v3.1's R bit:
> we should always add (RA|0) and just tack on CIA as a new addend, rather
> than replacing (RA|0) with CIA, this allows stuff like keeping jump tables
> and other lookup tables easily accessible:

iiinteresting. like it.

especially on ldst-with-update, that would give an address in RA that
could then be used with normal-ldst

but... ah hang on, remember the nightmare lesson learned from trying
to do ld-st-with-shift?  this pc-relative bit is along those lines:
it's modifying instruction behaviour, and prohibiting vectorised-variants
from being able to do them.

it would actually be better i feel to propose these ldst-with-pc-relative
as new instructions (within the next EXT2nn group) and let them be
Vectorised (orthogonally).
Comment 7 Luke Kenneth Casson Leighton 2022-09-11 11:41:26 BST
(In reply to Luke Kenneth Casson Leighton from comment #6)

> it would actually be better i feel to propose these ldst-with-pc-relative
> as new instructions (within the next EXT2nn group) and let them be
> Vectorised (orthogonally).

(it already occurred to me to propose LDST-with-shift within the
 new EXT2nn group to be created under
 https://libre-soc.org/openpower/sv/rfc/ls001/ )
Comment 8 Jacob Lifshay 2022-09-11 13:45:25 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)
> (In reply to Jacob Lifshay from comment #2)
> > (In reply to Luke Kenneth Casson Leighton from comment #1)
> > > thinking out loud for bitallocations
> > > 
> > > * 12 EXTRA4x3
> > 
> > alternately 12 EXTRA3 x4
> 
> if you meant "3 extra scalar bits", yes.
> 
> [SVP64.EXTRA3 contains scalar/vector - vector is redundant for SVP64Single].

ah, yeah, i forgot that...i had meant SVP64's EXTRA3, but i guess we'd only need 2 bits for augmenting 5-bit register fields.

> 
> to get up to 2^7 there are 4 extra bits required
> because all CR fields are only 3 bit.
yes

> > 2 saturate -- unsigned/signed saturate differently, we'll want both -- this
> > should be encoded as a signed/unsigned bit and a saturate bit, since the
> > signed/unsigned bit is also useful for deciding to sign/zero extend
> > inputs/outputs when src/destelwid differ.
> 
> ahh ok
> 
> > also fp ops have different values worth saturating to:
> > * standard 0.0 to 1.0
> > * standard -1.0 to 1.0
> > * i32? u32? i8? u8? i16? u16?
> 
> change of meaning of the operation must not occur.  saturate to
> those values *and still be a float* (i.e. convert to int, saturate,
> then convert back to float), not a problem.

i meant to saturate float to the range of those int types, not to convert to an int and back. e.g. 23.5 saturating to u8's range is 23.5, but 300.25 goes to 255.0.

>  change the meaning
> of the operation to store in an *int*, we can't do that.

yup, not my intention.
> 
> but... hmmm... saturating on some of the fp-to-int operations,
> yes not a problem there.

fp-to-int operations would use standard int saturation modes, not fp saturation modes -- the destination isn't a float after all.
Comment 9 Jacob Lifshay 2022-09-11 14:18:16 BST
(In reply to Luke Kenneth Casson Leighton from comment #7)
> (In reply to Luke Kenneth Casson Leighton from comment #6)
> 
> > it would actually be better i feel to propose these ldst-with-pc-relative
> > as new instructions (within the next EXT2nn group) and let them be
> > Vectorised (orthogonally).
> 
> (it already occurred to me to propose LDST-with-shift within the
>  new EXT2nn group to be created under
>  https://libre-soc.org/openpower/sv/rfc/ls001/ )

if you're working on the ldst-with-shift, please reserve 1 setting for range-checked ld/st with 32/64-bit addresses for webassembly...the range to check against will be stored in user-visible sprs (details to be specified later).

e.g. if ld/st shift is:
| prefix PO 0..5 | ext2nn 6..7 | pcrel R 8 | immhi? idk 13..31 |
| suffix PO 0..5 | rt 6..10 | ra 11..15 | rb 16..20 | shift 21..24 | immlo 25..28 | XO 29..31 |

then shift can be (note ld/st shift can access arrays of structs too, so shift size != ld/st size makes sense):
0 -- rb indexes bytes
1 -- rb indexes half-words (2 bytes)
2 -- rb indexes words (4 bytes)
3 -- rb indexes double words (8 bytes)
4 -- rb indexes quad words (16 bytes)
5 -- rb indexes oct words (32 bytes)
6 -- rb indexes 16-words (64 bytes)
7 -- wasm mode -- rb indexes bytes -- exact process determined by SPRs (TBD)
Comment 10 Luke Kenneth Casson Leighton 2022-09-11 14:30:26 BST
(In reply to Jacob Lifshay from comment #8)

> i meant to saturate float to the range of those int types, not to convert to
> an int and back. e.g. 23.5 saturating to u8's range is 23.5, but 300.25 goes
> to 255.0.

*click*. using dest elwidth.  i like it.
is starting to encroach on the reserved space though.

if nothing else they can be added on top of those
MODULO-FP operations.

(In reply to Jacob Lifshay from comment #9)
> if you're working on the ldst-with-shift, 

heck no.  way too much else to do.

> 7 -- wasm mode -- rb indexes bytes -- exact process determined by SPRs (TBD)

SPRs controlling something as lowlevel and critical as LDST now requiring
*five* 64-bit read ports and two writes? not a snowball in hell's chance
that would get through the ISA WG, the IBM POWER Architects would take
an extremely dim view of such costly operstions.

ldst-with-shift is already pushing the limits.

you really do need to think through the regfile port allocation and
Hazard Management implications, jacob,
you can't just blithely throw features in without thinking through
the micro-architectural implications.
Comment 11 Jacob Lifshay 2022-09-11 14:41:54 BST
(In reply to Luke Kenneth Casson Leighton from comment #10)
> (In reply to Jacob Lifshay from comment #8)
> 
> > i meant to saturate float to the range of those int types, not to convert to
> > an int and back. e.g. 23.5 saturating to u8's range is 23.5, but 300.25 goes
> > to 255.0.
> 
> *click*. using dest elwidth.

*not* using dest elwid (unless it's repurposed for this and not elwid when saturation is enabled). if you have a f32 you may want to saturate to u16 range with a f32 result...

>  i like it.
> is starting to encroach on the reserved space though.
> 
> if nothing else they can be added on top of those
> MODULO-FP operations.
> 
> (In reply to Jacob Lifshay from comment #9)
> > if you're working on the ldst-with-shift, 
> 
> heck no.  way too much else to do.
> 
> > 7 -- wasm mode -- rb indexes bytes -- exact process determined by SPRs (TBD)
> 
> SPRs controlling something as lowlevel and critical as LDST now requiring
> *five* 64-bit read ports and two writes? not a snowball in hell's chance
> that would get through the ISA WG, the IBM POWER Architects would take
> an extremely dim view of such costly operstions.

it's fine if they're somewhat more expensive, wasm ld/st isn't a standard ld/st op. they are necessary for wasm performance though, unless you like a 30% performance hit due to constantly needing several more instructions for each ld/st?

> 
> ldst-with-shift is already pushing the limits.
> 
> you really do need to think through the regfile port allocation and
> Hazard Management implications, jacob,
> you can't just blithely throw features in without thinking through
> the micro-architectural implications.

those sprs can be assumed to rarely change, allowing them to be cached at the ld/st alus and requiring a full pipeline flush each time they're changed -- that way they don't need dependency tracking or extra register read ports or a pile of extra wires into the ld/st alus.
Comment 12 Jacob Lifshay 2022-09-11 14:52:43 BST
(In reply to Jacob Lifshay from comment #11)
> (In reply to Luke Kenneth Casson Leighton from comment #10)
> > (In reply to Jacob Lifshay from comment #8)
> > 
> > > i meant to saturate float to the range of those int types, not to convert to
> > > an int and back. e.g. 23.5 saturating to u8's range is 23.5, but 300.25 goes
> > > to 255.0.
> > 
> > *click*. using dest elwidth.
> 
> *not* using dest elwid (unless it's repurposed for this and not elwid when
> saturation is enabled). if you have a f32 you may want to saturate to u16
> range with a f32 result...

actually, if there's room, adding that to svp64 for saturation (changing dest elwid to instead be range to saturate to when saturation is enabled for fp dests) is imho a good idea! just make sure you can still saturate to 0.0 to 1.0 with fp destination type different than src type -- iirc a/v needs that.
Comment 13 Luke Kenneth Casson Leighton 2022-10-03 11:54:52 BST
full review needed, answering question:

    if sv.op RT.scalar RA.scalar RB.scalar is set to "VL=1" is anything lost?

https://libre-soc.org/openpower/sv/svp64/discussion/
Comment 14 Luke Kenneth Casson Leighton 2022-10-04 16:14:23 BST
(In reply to Luke Kenneth Casson Leighton from comment #13)
> full review needed, answering question:
> 
>     if sv.op RT.scalar RA.scalar RB.scalar is set to "VL=1" is anything lost?
> 
> https://libre-soc.org/openpower/sv/svp64/discussion/

answer is yes: predication "are any bits set" effect on nonzeroing,
has to become "is first bit set" on any "sv.op/pm=xx SCALAR,SCALAR,SCALAR"
operation.

this may not be such a great loss compared to being able to drop
recurring "setvl VL=1" instructions
Comment 15 Luke Kenneth Casson Leighton 2023-04-01 00:52:42 BST
additional mode-bit needed for LD/ST-update PI Mode (post-increment).