use one of the 64 EXT001 areas to allow full access to all scalar regs. needs an entirely new type of EXTRA encoding exclusively dedicated to scalar. * must be capable of reaching CR0..CR127 even for 3-arg CR operations. * must be capable of reaching GPR0..127 and FPR0..127 even for 4-arg ALU ops (fmadd, isel) provide predication (just one bit, single-only not twin), element-width-overrides (source and dest), saturation. edit: also up for consideration / discussion: should subvl be an option? edit: https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-October/005342.html an idea for having sv.op RT.scalar RA.scalar RB.scalar be an encoding for "hard-set VL=1".
thinking out loud for bitallocations * 12 EXTRA4x3 * 4 src/dst 2x2 elwidth * 4 predicate mask 1 type 3 select * 1 saturate totals 21, 3 spare. EXTRA5? leave as reserved?
(In reply to Luke Kenneth Casson Leighton from comment #1) > thinking out loud for bitallocations > > * 12 EXTRA4x3 alternately 12 EXTRA3 x4 > * 4 src/dst 2x2 elwidth > * 4 predicate mask 1 type 3 select > * 1 saturate 2 saturate -- unsigned/signed saturate differently, we'll want both -- this should be encoded as a signed/unsigned bit and a saturate bit, since the signed/unsigned bit is also useful for deciding to sign/zero extend inputs/outputs when src/destelwid differ. also fp ops have different values worth saturating to: * standard 0.0 to 1.0 * standard -1.0 to 1.0 * i32? u32? i8? u8? i16? u16? > > totals 21, 3 spare. EXTRA5? leave as reserved? imho we should leave as reserved because that reduces the amount of space required to 1/4th (accounting for 2 saturate bits) as much as svp64, greatly reducing opcode pressure.
(In reply to Jacob Lifshay from comment #2) > > > > totals 21, 3 spare. EXTRA5? leave as reserved? > > imho we should leave as reserved because that reduces the amount of space > required to 1/4th (accounting for 2 saturate bits) as much as svp64, greatly > reducing opcode pressure. though maybe use 1 bit so loads/stores can be pc-relative, like the v3.1 prefix's R bit. this can greatly reduce needed instruction counts to access constants/variables, though imho we should differ from v3.1's R bit: we should always add (RA|0) and just tack on CIA as a new addend, rather than replacing (RA|0) with CIA, this allows stuff like keeping jump tables and other lookup tables easily accessible: pseudocode i'm proposing for SV-single: if SVSR = 0 then # SV-single's R field addr <- (RA|0) + EXTS64(D) # normal ld/std else addr <- CIA + (RA|0) + EXT64(D) # pc-rel ld/std note that the CIA + D add can be done well before the execute stage, since both are known values at the decode stage, avoiding extra latency due to more complex addr calculation in ld/st pipe example jump table with pc-rel, r3 on input is 0, 8, or 16: table: .long label1 .long label2 .long label3 ... code: svs.ld r3, table@pcrel(r3), 1 mtctr r3 bctr equivalent v3.1 code: table: .long label1 .long label2 .long label3 ... code: paddi r3, r3, table@pcrel, 1 ld r3, 0(r3) # can't use pld R=1 since it ignores input registers mtctr r3 bctr
(In reply to Jacob Lifshay from comment #3) > equivalent v3.1 code: > table: > .long label1 > .long label2 > .long label3 > ... > code: > paddi r3, r3, table@pcrel, 1 > ld r3, 0(r3) # can't use pld R=1 since it ignores input registers > mtctr r3 > bctr actually can't use paddi like that since it has the same issue as pld R=1 fixed: code: paddi r4, 0, table@pcrel, 1 ldx r3, r3, r4 # can't use pld R=1 since it ignores input registers mtctr r3 bctr
(In reply to Jacob Lifshay from comment #2) > (In reply to Luke Kenneth Casson Leighton from comment #1) > > thinking out loud for bitallocations > > > > * 12 EXTRA4x3 > > alternately 12 EXTRA3 x4 if you meant "3 extra scalar bits", yes. [SVP64.EXTRA3 contains scalar/vector - vector is redundant for SVP64Single]. to get up to 2^7 there are 4 extra bits required because all CR fields are only 3 bit. > 2 saturate -- unsigned/signed saturate differently, we'll want both -- this > should be encoded as a signed/unsigned bit and a saturate bit, since the > signed/unsigned bit is also useful for deciding to sign/zero extend > inputs/outputs when src/destelwid differ. ahh ok > also fp ops have different values worth saturating to: > * standard 0.0 to 1.0 > * standard -1.0 to 1.0 > * i32? u32? i8? u8? i16? u16? change of meaning of the operation must not occur. saturate to those values *and still be a float* (i.e. convert to int, saturate, then convert back to float), not a problem. change the meaning of the operation to store in an *int*, we can't do that. but... hmmm... saturating on some of the fp-to-int operations, yes not a problem there. > > > > totals 21, 3 spare. EXTRA5? leave as reserved? > > imho we should leave as reserved because that reduces the amount of space > required to 1/4th (accounting for 2 saturate bits) as much as svp64, greatly > reducing opcode pressure. there's no reserved encoding *at all* for SVP64 so yes, good to start doing that.
(In reply to Jacob Lifshay from comment #3) > (In reply to Jacob Lifshay from comment #2) > > > > > > totals 21, 3 spare. EXTRA5? leave as reserved? > > > > imho we should leave as reserved because that reduces the amount of space > > required to 1/4th (accounting for 2 saturate bits) as much as svp64, greatly > > reducing opcode pressure. > > though maybe use 1 bit so loads/stores can be pc-relative, like the v3.1 > prefix's R bit. this can greatly reduce needed instruction counts to access > constants/variables, though imho we should differ from v3.1's R bit: > we should always add (RA|0) and just tack on CIA as a new addend, rather > than replacing (RA|0) with CIA, this allows stuff like keeping jump tables > and other lookup tables easily accessible: iiinteresting. like it. especially on ldst-with-update, that would give an address in RA that could then be used with normal-ldst but... ah hang on, remember the nightmare lesson learned from trying to do ld-st-with-shift? this pc-relative bit is along those lines: it's modifying instruction behaviour, and prohibiting vectorised-variants from being able to do them. it would actually be better i feel to propose these ldst-with-pc-relative as new instructions (within the next EXT2nn group) and let them be Vectorised (orthogonally).
(In reply to Luke Kenneth Casson Leighton from comment #6) > it would actually be better i feel to propose these ldst-with-pc-relative > as new instructions (within the next EXT2nn group) and let them be > Vectorised (orthogonally). (it already occurred to me to propose LDST-with-shift within the new EXT2nn group to be created under https://libre-soc.org/openpower/sv/rfc/ls001/ )
(In reply to Luke Kenneth Casson Leighton from comment #5) > (In reply to Jacob Lifshay from comment #2) > > (In reply to Luke Kenneth Casson Leighton from comment #1) > > > thinking out loud for bitallocations > > > > > > * 12 EXTRA4x3 > > > > alternately 12 EXTRA3 x4 > > if you meant "3 extra scalar bits", yes. > > [SVP64.EXTRA3 contains scalar/vector - vector is redundant for SVP64Single]. ah, yeah, i forgot that...i had meant SVP64's EXTRA3, but i guess we'd only need 2 bits for augmenting 5-bit register fields. > > to get up to 2^7 there are 4 extra bits required > because all CR fields are only 3 bit. yes > > 2 saturate -- unsigned/signed saturate differently, we'll want both -- this > > should be encoded as a signed/unsigned bit and a saturate bit, since the > > signed/unsigned bit is also useful for deciding to sign/zero extend > > inputs/outputs when src/destelwid differ. > > ahh ok > > > also fp ops have different values worth saturating to: > > * standard 0.0 to 1.0 > > * standard -1.0 to 1.0 > > * i32? u32? i8? u8? i16? u16? > > change of meaning of the operation must not occur. saturate to > those values *and still be a float* (i.e. convert to int, saturate, > then convert back to float), not a problem. i meant to saturate float to the range of those int types, not to convert to an int and back. e.g. 23.5 saturating to u8's range is 23.5, but 300.25 goes to 255.0. > change the meaning > of the operation to store in an *int*, we can't do that. yup, not my intention. > > but... hmmm... saturating on some of the fp-to-int operations, > yes not a problem there. fp-to-int operations would use standard int saturation modes, not fp saturation modes -- the destination isn't a float after all.
(In reply to Luke Kenneth Casson Leighton from comment #7) > (In reply to Luke Kenneth Casson Leighton from comment #6) > > > it would actually be better i feel to propose these ldst-with-pc-relative > > as new instructions (within the next EXT2nn group) and let them be > > Vectorised (orthogonally). > > (it already occurred to me to propose LDST-with-shift within the > new EXT2nn group to be created under > https://libre-soc.org/openpower/sv/rfc/ls001/ ) if you're working on the ldst-with-shift, please reserve 1 setting for range-checked ld/st with 32/64-bit addresses for webassembly...the range to check against will be stored in user-visible sprs (details to be specified later). e.g. if ld/st shift is: | prefix PO 0..5 | ext2nn 6..7 | pcrel R 8 | immhi? idk 13..31 | | suffix PO 0..5 | rt 6..10 | ra 11..15 | rb 16..20 | shift 21..24 | immlo 25..28 | XO 29..31 | then shift can be (note ld/st shift can access arrays of structs too, so shift size != ld/st size makes sense): 0 -- rb indexes bytes 1 -- rb indexes half-words (2 bytes) 2 -- rb indexes words (4 bytes) 3 -- rb indexes double words (8 bytes) 4 -- rb indexes quad words (16 bytes) 5 -- rb indexes oct words (32 bytes) 6 -- rb indexes 16-words (64 bytes) 7 -- wasm mode -- rb indexes bytes -- exact process determined by SPRs (TBD)
(In reply to Jacob Lifshay from comment #8) > i meant to saturate float to the range of those int types, not to convert to > an int and back. e.g. 23.5 saturating to u8's range is 23.5, but 300.25 goes > to 255.0. *click*. using dest elwidth. i like it. is starting to encroach on the reserved space though. if nothing else they can be added on top of those MODULO-FP operations. (In reply to Jacob Lifshay from comment #9) > if you're working on the ldst-with-shift, heck no. way too much else to do. > 7 -- wasm mode -- rb indexes bytes -- exact process determined by SPRs (TBD) SPRs controlling something as lowlevel and critical as LDST now requiring *five* 64-bit read ports and two writes? not a snowball in hell's chance that would get through the ISA WG, the IBM POWER Architects would take an extremely dim view of such costly operstions. ldst-with-shift is already pushing the limits. you really do need to think through the regfile port allocation and Hazard Management implications, jacob, you can't just blithely throw features in without thinking through the micro-architectural implications.
(In reply to Luke Kenneth Casson Leighton from comment #10) > (In reply to Jacob Lifshay from comment #8) > > > i meant to saturate float to the range of those int types, not to convert to > > an int and back. e.g. 23.5 saturating to u8's range is 23.5, but 300.25 goes > > to 255.0. > > *click*. using dest elwidth. *not* using dest elwid (unless it's repurposed for this and not elwid when saturation is enabled). if you have a f32 you may want to saturate to u16 range with a f32 result... > i like it. > is starting to encroach on the reserved space though. > > if nothing else they can be added on top of those > MODULO-FP operations. > > (In reply to Jacob Lifshay from comment #9) > > if you're working on the ldst-with-shift, > > heck no. way too much else to do. > > > 7 -- wasm mode -- rb indexes bytes -- exact process determined by SPRs (TBD) > > SPRs controlling something as lowlevel and critical as LDST now requiring > *five* 64-bit read ports and two writes? not a snowball in hell's chance > that would get through the ISA WG, the IBM POWER Architects would take > an extremely dim view of such costly operstions. it's fine if they're somewhat more expensive, wasm ld/st isn't a standard ld/st op. they are necessary for wasm performance though, unless you like a 30% performance hit due to constantly needing several more instructions for each ld/st? > > ldst-with-shift is already pushing the limits. > > you really do need to think through the regfile port allocation and > Hazard Management implications, jacob, > you can't just blithely throw features in without thinking through > the micro-architectural implications. those sprs can be assumed to rarely change, allowing them to be cached at the ld/st alus and requiring a full pipeline flush each time they're changed -- that way they don't need dependency tracking or extra register read ports or a pile of extra wires into the ld/st alus.
(In reply to Jacob Lifshay from comment #11) > (In reply to Luke Kenneth Casson Leighton from comment #10) > > (In reply to Jacob Lifshay from comment #8) > > > > > i meant to saturate float to the range of those int types, not to convert to > > > an int and back. e.g. 23.5 saturating to u8's range is 23.5, but 300.25 goes > > > to 255.0. > > > > *click*. using dest elwidth. > > *not* using dest elwid (unless it's repurposed for this and not elwid when > saturation is enabled). if you have a f32 you may want to saturate to u16 > range with a f32 result... actually, if there's room, adding that to svp64 for saturation (changing dest elwid to instead be range to saturate to when saturation is enabled for fp dests) is imho a good idea! just make sure you can still saturate to 0.0 to 1.0 with fp destination type different than src type -- iirc a/v needs that.
full review needed, answering question: if sv.op RT.scalar RA.scalar RB.scalar is set to "VL=1" is anything lost? https://libre-soc.org/openpower/sv/svp64/discussion/
(In reply to Luke Kenneth Casson Leighton from comment #13) > full review needed, answering question: > > if sv.op RT.scalar RA.scalar RB.scalar is set to "VL=1" is anything lost? > > https://libre-soc.org/openpower/sv/svp64/discussion/ answer is yes: predication "are any bits set" effect on nonzeroing, has to become "is first bit set" on any "sv.op/pm=xx SCALAR,SCALAR,SCALAR" operation. this may not be such a great loss compared to being able to drop recurring "setvl VL=1" instructions
additional mode-bit needed for LD/ST-update PI Mode (post-increment).