238 – POWER Compressed Formal Standard writeup

Bug 238 - POWER Compressed Formal Standard writeup

Summary: POWER Compressed Formal Standard writeup

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Specification (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Alain D D Williams

URL:

Depends on:	532
Blocks:	174
	Show dependency tree / graph

Reported:	2020-03-13 12:53 GMT by Luke Kenneth Casson Leighton
Modified:	2022-10-25 10:55 BST (History)
CC List:	5 users (show)

See Also:	308 213 529 237
NLnet milestone:	NLNet.2019.10.046.Standards
total budget (EUR) for completion of task and all subtasks:	5000
budget (EUR) for this task, excluding subtasks' budget:	5000
parent task for budget allocation:	174
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:	dan = {amount=1000, submitted=2022-10-07, paid=2022-10-14} addw = {amount=1000, submitted=2022-08-18, paid=2022-08-25} red = { amount = 2000, submitted = 2022-06-25, paid = 2022-07-21 } [jacob] amount = 1000 submitted = 2022-07-06 paid = 2022-07-21

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2020-03-13 12:53:39 GMT

A formal write-up of a suitable compressed ISA Opcode protocol is needed, to
be proposed to the OpenPOWER Foundation at a later date.

https://libre-soc.org/openpower/sv/16_bit_compressed/

VLE useful reference: http://application-notes.digchip.com/314/314-68105.pdf

http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-cores/2020-November/000210.html

Comment 1 Luke Kenneth Casson Leighton 2020-11-14 22:52:35 GMT

ideas currently are to move "into" 16-bit mode by way of 2 major opcodes,
and to remain there as long as bit 15 is "1".  however when transitioning
"into" 16-bit mode, that leaves 11 bits (actually 10 excluding the "remain
in C mode" bit) which could be used for actual instructions.

by cutting back on functionality (and register range) in 10-bit mode then
extending the register range with the remaining 5 bits it seems to result
in a reasonably comprehensive compressed ISA.

Comment 2 Jacob Lifshay 2020-11-15 02:12:38 GMT

one important note, in floating-point 0 - x does not give the same results as -x

the values you get for 0 - 0 depend on the sign of the two inputs and on the rounding mode.

So, I'd instead suggest combining fneg with fmr.


An additional idea, if we have to use an additional 16-bit instruction to enter 16-bit mode: it's common to have just 1 32-bit (or wider) instruction in a sequence of 16-bit instructions, it might be a good idea to have a limited subset of 16-bit instructions have additional info included to tell the processor to exit 16-bit mode for just 1 instruction, then reenter 16-bit mode without needing to use an enter-16-bit-mode instruction.

Comment 3 Luke Kenneth Casson Leighton 2020-11-15 03:04:54 GMT

(In reply to Jacob Lifshay from comment #2)
> one important note, in floating-point 0 - x does not give the same results
> as -x
> 
> the values you get for 0 - 0 depend on the sign of the two inputs and on the
> rounding mode.
> 
> So, I'd instead suggest combining fneg with fmr.

hm hm fmr explicitly requires 2 registers (src, dest).  in 10-bit mode there's
so little space that the only possible candidate for dropping and replacing
with fmr. would be fmul.

fmul (and fdiv?) could then be made available only in 16-bit mode (not
10-bit)

what do you think of changing fmul to have RA==0b000 to indicate "fmr."
(and leave fsub. to have RA==0b000 to indicate "fneg.")?

i am currently just arbitrarily experimenting, there's no backup studies
to show statistical preference for one combination over another, here


> An additional idea, if we have to use an additional 16-bit instruction to
> enter 16-bit mode: it's common to have just 1 32-bit (or wider) instruction
> in a sequence of 16-bit instructions, it might be a good idea to have a
> limited subset of 16-bit instructions have additional info included to tell
> the processor to exit 16-bit mode for just 1 instruction, then reenter
> 16-bit mode without needing to use an enter-16-bit-mode instruction.

it's... doable.  the instruction groups with a spare bit are:

* FP
* Logical
* Arithmetic
* System sc/rfid

LD i wanted to reserve the 2 spare bits for width, update and other modes.

Comment 4 Jacob Lifshay 2020-11-15 05:58:01 GMT

(In reply to Luke Kenneth Casson Leighton from comment #3)
> (In reply to Jacob Lifshay from comment #2)
> > one important note, in floating-point 0 - x does not give the same results
> > as -x
> > 
> > the values you get for 0 - 0 depend on the sign of the two inputs and on the
> > rounding mode.
> > 
> > So, I'd instead suggest combining fneg with fmr.
> 
> hm hm fmr explicitly requires 2 registers (src, dest).  in 10-bit mode
> there's
> so little space that the only possible candidate for dropping and replacing
> with fmr. would be fmul.

Will require some thought.

> 
> fmul (and fdiv?) could then be made available only in 16-bit mode (not
> 10-bit)
> 
> what do you think of changing fmul to have RA==0b000 to indicate "fmr."
> (and leave fsub. to have RA==0b000 to indicate "fneg.")?
> 
> i am currently just arbitrarily experimenting, there's no backup studies
> to show statistical preference for one combination over another, here

Yeah, we need to at least check instruction statistics on some integer code.

> 
> > An additional idea, if we have to use an additional 16-bit instruction to
> > enter 16-bit mode: it's common to have just 1 32-bit (or wider) instruction
> > in a sequence of 16-bit instructions, it might be a good idea to have a
> > limited subset of 16-bit instructions have additional info included to tell
> > the processor to exit 16-bit mode for just 1 instruction, then reenter
> > 16-bit mode without needing to use an enter-16-bit-mode instruction.
> 
> it's... doable.  the instruction groups with a spare bit are:
> 
> * FP
> * Logical
> * Arithmetic
> * System sc/rfid

What are system instructions doing in 16-bit mode?!! they should be 32-bit only, they are very uncommon (with the possible exception of m[ft]ctr, m[ft]lr and maybe m[ft]cr).

> LD i wanted to reserve the 2 spare bits for width, update and other modes.

it will require some stats, but I think update can be left for 32-bit only.

Some other instructions that should be 32-bit mode only are the less commonly used CR instructions, only a few of them are commonly used. You can probably cram a addi instruction in there instead, those are very common, and it can double as mv.

Comment 5 cand 2020-11-15 07:27:16 GMT

Why have a "stay 16-bit" bit, instead of one instruction for "exit 16-bit mode"? Having 11 vs 10 bits available. Then the overhead of using 16-bit mode is enter + exit instruction, the compiler can calculate if there are at least two 16-bit instructions and so it's worth it.

Comment 6 cand 2020-11-15 07:28:59 GMT

Indeed, if the enter instruction is 16-bit itself, at the cost of more context-sensitiveness, the enter instruction could double as the exit instruction.

Comment 7 Luke Kenneth Casson Leighton 2020-11-15 12:05:20 GMT

(In reply to Jacob Lifshay from comment #4)

> > i am currently just arbitrarily experimenting, there's no backup studies
> > to show statistical preference for one combination over another, here
> 
> Yeah, we need to at least check instruction statistics on some integer code.

i planned ahead, here, and asked on openpower-hdl-cores at the beginning
of the year.  someone very kindly responded.
> > * System sc/rfid
> 
> What are system instructions doing in 16-bit mode?!!

i don't know!  they're in VLE! :)  and as they are "obscure bitfields"
that take up things i can't think of how to use with registers i went,
"mmm let's see".  if however there's a way to encode a register in that
space then yes.  if dropping both sc and rfid then a 1-reg operation
could use that and the rest of bits "8 9 a" to indicate which reg.

> they should be 32-bit
> only, they are very uncommon (with the possible exception of m[ft]ctr,
> m[ft]lr

for branches etc.  hence why i added m[ft]tar as well.

> and maybe m[ft]cr).

m[ft]cr if used for predication will become much more common.
 
> > LD i wanted to reserve the 2 spare bits for width, update and other modes.
> 
> it will require some stats, but I think update can be left for 32-bit only.

ok.

> 
> Some other instructions that should be 32-bit mode only are the less
> commonly used CR instructions, only a few of them are commonly used. 

except if CRs are used for predication, in which case the ones that transfer
them in and out of integers, as well as do some basic operations, become
a much higher priority.

> You can
> probably cram a addi instruction in there instead, those are very common,
> and it can double as mv.

    | 0 1 | 2 3 4 | | 567 | 8 9 a | b c d | e | f |
    |     |  RT   | | 010 | RB    | RA!=0 | 0 | 1 | add
    |     | offs  | | 010 | RA    | 0 0 0 | 0 | 1 | addi

very challenging / limited.

* maybe bit 1 can be used too, maybe bit 0
* it will only be active in 16-bit mode not 10-bit mode
* with no room for RT it can't act as a mv because it is either
  "set RA=offs" or it is "add offs to RA".

potentially, 1 bit of "8 9 a" could be used to indicate which of those
is to be done (at the expense of sacrificing the range of RA)

one other option (which i am not hugely keen on) is to - haha - copy
the v3.1B "prefix" mode concept and apply 16-bit immediate prefixes
to the following instruction.  given that this makes it a 32-bit
instruction it's like, um, "then use the 32-bit standard PowerISA variant,
duh" :)

i think this (subconsciously) has been why i haven't added any immediate
ops in at all, with the exception of branch.  i'd really like to try
to jam even a limited variant of branch-conditional in there, however
the encoding will be... complicated.

Comment 8 Luke Kenneth Casson Leighton 2020-11-15 12:24:08 GMT

(In reply to cand from comment #5)
> Why have a "stay 16-bit" bit, instead of one instruction for "exit 16-bit
> mode"?

because doing so punishes (penalises) very short sequences where
16-bit ops are regularly interleaved with 32-bit ones.

> Having 11 vs 10 bits available. Then the overhead of using 16-bit
> mode is enter + exit instruction, the compiler can calculate if there are at
> least two 16-bit instructions and so it's worth it.

one of those will be the "exit 16-bit mode" instruction.  it would need to
be at least 3 16-bit instructions to be worth it.

add r1, r1, r2
mulli r1, r2, 5
or r2, r2, r3
sub. r5, r5, r2
bcctr cr0

as 32-bit instructions these come out to 10 hwords.

with 1 bit saying "enter/exit":

* add -> 16bit (10bit mode with exit bit)
* mulli -> 32bit
* or -> 16bit (10bit mode)
* sub -> 16bit (16bit mode with exit bit)
* bcctr -> 32bit

total hwords: 7 - a 30% saving

with 1 *instruction* saying enter/exit:

* add -> 32bit (can't get out of 16-bit mode so no point going in)
* mulli -> 32bit
* or -> 16bit (11bit mode)
* sub -> 16bit (16bit mode)
         plus another 16bit "exit 16 bit" instruction
* bcctr -> 32bit

total hwords: 9 - only a 10% saving


so the overhead is: 1/16 bits encoding bonus in exchange for a 1-instruction
penalty, my feeling is that it's not a worthwhile price.

however!

what if we had alternative 16-bit encoding "paging" modes?  where some
(potentially entirely new) encodings were swapped in/out?

if those pages could be specified even at the 10-bit level that's where
an overhead of the use of 16 bits becomes worth it, because the chances
of needing context-relevant opcodes (heavy FP computation, or heavy
Video computation) become much higher.

what do you think?

Comment 9 Luke Kenneth Casson Leighton 2020-11-15 15:43:08 GMT

(In reply to Luke Kenneth Casson Leighton from comment #8)

> if those pages could be specified even at the 10-bit level that's where
> an overhead of the use of 16 bits becomes worth it, because the chances
> of needing context-relevant opcodes (heavy FP computation, or heavy
> Video computation) become much higher.

e.g. use the 10-bit mode to swap out the entirety of FP and replace them with YUV2RGB and DCT etc. etc. opcodes, even those needed just to hyper-optimise an inner loop.

(In reply to Jacob Lifshay from comment #4)
> (In reply to Luke Kenneth Casson Leighton from comment #3)
> > hm hm fmr explicitly requires 2 registers (src, dest).  in 10-bit mode
> > there's
> > so little space that the only possible candidate for dropping and replacing
> > with fmr. would be fmul.
> 
> Will require some thought.

in 16-bit mode there's a spare pair of bit (0-1).  one of those - and i stress
that this can only be done in 16-bit mode not 10-bit mode - can be used to
select alternative meanings.

* bit 1=0 - opcode is fmul.  bit 1=0 - opcode is fdiv
* bit 1=0 - opcode is fabs.  bit 1=0 - opcode is fmr

(In reply to Jacob Lifshay from comment #2)

> An additional idea, if we have to use an additional 16-bit instruction to
> enter 16-bit mode: it's common to have just 1 32-bit (or wider) instruction
> in a sequence of 16-bit instructions, it might be a good idea to have a
> limited subset of 16-bit instructions have additional info included to tell
> the processor to exit 16-bit mode for just 1 instruction, then reenter
> 16-bit mode without needing to use an enter-16-bit-mode instruction.

in 16-bit mode, bit 15 plus bit 0 can be used to indicate this change-over.
however fascinatingly there are still some combinations left:

* 0:f = 0b01: stay in 16-bit mode
* 0:f = 0b00: leave 16-bit mode permanently
* 0:f = 0b10: leave 16-bit mode for one cycle
* 0:f = 0b11: free to use for something completely different.

perhaps that could be used to redefine the [entire] opcode space to
"immediate" variants?  addi, addis, mulli, ori etc.

or, another alternative: add one extra bit to all registers, switching
to a high bank (r8-r15)

my feeling is that immediate-variants of a few select instructions
(including branch-conditional) would have a higher bang-per-buck

the only thing to watch out for here: this is getting pretty complex.

Comment 10 cand 2020-11-15 17:39:13 GMT

I feel paging would be too inflexible, unless it was configurable at runtime. And that is both plain crazy, and interesting given LTO. Gcc could link-time-optimize the whole program, figure out it uses less than 1024 instructions, and create a custom page loaded at startup.

Comment 11 Luke Kenneth Casson Leighton 2020-11-15 19:15:13 GMT

(In reply to cand from comment #10)
> I feel paging would be too inflexible, unless it was configurable at
> runtime. And that is both plain crazy, and interesting given LTO. Gcc could
> link-time-optimize the whole program, figure out it uses less than 1024
> instructions, and create a custom page loaded at startup.

that was, from what i understand, partly the intent of transmeta.

the compromise is instead to analyse certain computationally-intensive
workloads.  yet with different tasks if we do _not_ have banks optimised
for specific tasks and workloads, *all* tasks and workloads are penalised
as a result.

that in turn loses us the opportunity to reduce cache size or usage,
which in turn pushes up power consumption.

Western Digital showed that even a small change of adding the most
commonly-used instructions that were used for *their* workload (HDD/SSD
workload), they got a huge extra compression ratio in program size.

that was only 4-5 instructions that resulted in something like a massive
10% program size reduction over and above what RVC already provides.

Comment 12 Jacob Lifshay 2020-11-15 19:53:52 GMT

(In reply to Luke Kenneth Casson Leighton from comment #7)
> (In reply to Jacob Lifshay from comment #4)
> 
> > > i am currently just arbitrarily experimenting, there's no backup studies
> > > to show statistical preference for one combination over another, here
> > 
> > Yeah, we need to at least check instruction statistics on some integer code.
> 
> i planned ahead, here, and asked on openpower-hdl-cores at the beginning
> of the year.  someone very kindly responded.
> > > * System sc/rfid
> > 
> > What are system instructions doing in 16-bit mode?!!
> 
> i don't know!  they're in VLE! :)

I haven't looked at the spec for VLE, but if it's like ARM's Thumb1 ISA where 32-bit instructions aren't supported, then you do need sc. We are planning on an ISA more like RISC-V's C extension where all it does is add shorter forms of already existing instructions and the longer instructions are still supported, so we don't have to support the less commonly used instructions, unlike Thumb1. 

>  and as they are "obscure bitfields"
> that take up things i can't think of how to use with registers i went,
> "mmm let's see".  if however there's a way to encode a register in that
> space then yes.  if dropping both sc and rfid then a 1-reg operation
> could use that and the rest of bits "8 9 a" to indicate which reg.
> 
> > they should be 32-bit
> > only, they are very uncommon (with the possible exception of m[ft]ctr,
> > m[ft]lr
> 
> for branches etc.  hence why i added m[ft]tar as well.

Since tar is only usable from privileged mode, it will make m[ft]tar and btar into very uncommon instructions, since compilers most likely won't ever use them unless you explicitly ask for it using either intrinsics or assembly.

> 
> > and maybe m[ft]cr).
> 
> m[ft]cr if used for predication will become much more common.

Even if we use CR regs for predication, I think using instructions that have only 1 bit per element in integer regs instead of the 4 bits that mfcr uses will be much more commonly used. Similarly for mtcr.

> > > LD i wanted to reserve the 2 spare bits for width, update and other modes.
> > 
> > it will require some stats, but I think update can be left for 32-bit only.
> 
> ok.
> 
> > 
> > Some other instructions that should be 32-bit mode only are the less
> > commonly used CR instructions, only a few of them are commonly used. 
> 
> except if CRs are used for predication, in which case the ones that transfer
> them in and out of integers, as well as do some basic operations, become
> a much higher priority.

Hence why you support just and, or, andc, and maybe xor, since you rarely need most of the others, even if using them for predication. if you need a less common op, use a 32-bit instruction, they're not that expensive.
> 
> > You can
> > probably cram a addi instruction in there instead, those are very common,
> > and it can double as mv.
> 
>     | 0 1 | 2 3 4 | | 567 | 8 9 a | b c d | e | f |
>     |     |  RT   | | 010 | RB    | RA!=0 | 0 | 1 | add
>     |     | offs  | | 010 | RA    | 0 0 0 | 0 | 1 | addi

Works better if you have nearly all 16/10-bit instructions use one register for both source and destination (like x86 and RVC) instead of using an extra 3 bits to encode the third register.

addi is common enough that you'd want to dedicate some extra encoding space to it.

Some stats:
generated using (on ppc64le):
objdump -d --no-show-raw-insn /bin/bash | sed 'y/\t/ /;
    s/^[ x0-9A-F]*: *\([a-z.]\+\) *\(.*\)/\1 \2 /p; d' |
    sed 's/\([, (]\)r[1-9][0-9]*/\1r1/g;
    s/\([ ,]\)-*[0-9]\+\([^0-9]\)/\11\2/g' | sort | uniq --count |
    sort -n | less

...
    25 ldu r1,1(r1)
...
    207 stdu r1,1(r1)
    225 std r0,1(r1)
    284 mflr r0
    285 add r1,r1,r1
    304 lbz r1,1(r1)
    326 mtlr r0
    327 ld r0,1(r1)
    343 blr
    406 lwa r1,1(r1)
    466 extsw r1,r1
    649 stw r1,1(r1)
    691 lwz r1,1(r1)
    705 cmpdi r1,1
    791 cmpwi r1,1
    794 addis r1,r1,1
   1474 std r1,1(r1)
   1846 li r1,1
   2031 mr r1,r1
   2473 addi r1,r1,1
   3012 nop
   3028 ld r1,1(r1)

Comment 13 Luke Kenneth Casson Leighton 2020-11-15 20:22:15 GMT

(In reply to Jacob Lifshay from comment #12)
> (In reply to Luke Kenneth Casson Leighton from comment #7)
> > (In reply to Jacob Lifshay from comment #4)
> > 
> > > > i am currently just arbitrarily experimenting, there's no backup studies
> > > > to show statistical preference for one combination over another, here
> > > 
> > > Yeah, we need to at least check instruction statistics on some integer code.
> > 
> > i planned ahead, here, and asked on openpower-hdl-cores at the beginning
> > of the year.  someone very kindly responded.
> > > > * System sc/rfid
> > > 
> > > What are system instructions doing in 16-bit mode?!!
> > 
> > i don't know!  they're in VLE! :)
> 
> I haven't looked at the spec for VLE, but if it's like ARM's Thumb1 ISA
> where 32-bit instructions aren't supported, then you do need sc.

i read it again last night and it seems it does sit "on top" of what was v2.06 BE 32 bit ISA at the time.

ie it is the PowerISA direct equivalent of RVC.  the only weirdness is that some entirely new 32 bit encodings (extensions of 16 bit to give slightly longer immediates) were added.


> We are
> planning on an ISA more like RISC-V's C extension where all it does is add
> shorter forms of already existing instructions 

yes.  the only extra one would be bank-switching.


> > for branches etc.  hence why i added m[ft]tar as well.
> 
> Since tar is only usable from privileged mode, it will make m[ft]tar and
> btar into very uncommon instructions, 

really? ha! then out they go.  it may be possible to drop in bc after all.  have to check.


> > m[ft]cr if used for predication will become much more common.
> 
> Even if we use CR regs for predication, I think using instructions that have
> only 1 bit per element in integer regs instead of the 4 bits that mfcr uses
> will be much more commonly used. Similarly for mtcr.

this is best discussed on the predication sv issue, and we can come back here after.  context/reminder: PowerISA is fundamentally designed to make data-dependent decisions based around CRs.
 

> Hence why you support just and, or, andc, and maybe xor, since you rarely
> need most of the others, even if using them for predication. if you need a
> less common op, use a 32-bit instruction, they're not that expensive.

again, let's transfer this to the sv predication issue.

> > 
> > > You can
> > > probably cram a addi instruction in there instead, those are very common,
> > > and it can double as mv.
> > 
> >     | 0 1 | 2 3 4 | | 567 | 8 9 a | b c d | e | f |
> >     |     |  RT   | | 010 | RB    | RA!=0 | 0 | 1 | add
> >     |     | offs  | | 010 | RA    | 0 0 0 | 0 | 1 | addi
> 
> Works better if you have nearly all 16/10-bit instructions use one register
> for both source and destination (like x86 and RVC) instead of using an extra
> 3 bits to encode the third register.

yes.  this is in the 10bit mode.  i couldn't think what to do with the 16 bits so went "err let's add RT".

now i am just realising the 3 bits could be used for size or rounding modes, particularly for FP and *maybe* LD. and cmp it looks useful.

> addi is common enough that you'd want to dedicate some extra encoding space
> to it.

i added some for 16bit mode.
 
> Some stats:
> generated using (on ppc64le):
> objdump -d --no-show-raw-insn /bin/bash | sed 'y/\t/ /;
>     s/^[ x0-9A-F]*: *\([a-z.]\+\) *\(.*\)/\1 \2 /p; d' |
>     sed 's/\([, (]\)r[1-9][0-9]*/\1r1/g;
>     s/\([ ,]\)-*[0-9]\+\([^0-9]\)

niice.

Comment 14 Luke Kenneth Casson Leighton 2020-11-16 01:01:16 GMT

whilst a case for using the spare bits 2-4 as FP mode bits (single/double, rounding) is fairly clear, for INT it is less so

a trick:

| N |   | RT!=0 | | 011 | RB    | RA!=0 | 0 | M | sub.
| N | 0 | 000   | | 011 | RB    | RA!=0 | 0 | M | cmpw
| N | 1 | 000   | | 011 | RB    | RA!=0 | 0 | M | cmpl

by testing RT nonzero as the destination this leaves the actual subtraction (which is exactly what is used for cmp) to be tested for cmp but not stored, exactly as with cmp, when RT=0b000

additionally one of the downsides of only 2 regs for operations is that it is necessary to use a lot of mv operations to substitute for the lack.

with mv being 16bit and the op itself being 16bit there is no point, you might as well use the 32bit version which can do dest,src1,src2

which means that having "dest, src1, src2" gives an advantage as it saves 16 bits.

Comment 15 Luke Kenneth Casson Leighton 2020-11-16 01:06:00 GMT

(In reply to Jacob Lifshay from comment #12)

> ...
>     25 ldu r1,1(r1)
> ...
>     207 stdu r1,1(r1)
>     225 std r0,1(r1)
>     284 mflr r0
>     285 add r1,r1,r1
>     304 lbz r1,1(r1)
>     326 mtlr r0
>     327 ld r0,1(r1)
>     343 blr
>     406 lwa r1,1(r1)
>     466 extsw r1,r1
>     649 stw r1,1(r1)
>     691 lwz r1,1(r1)
>     705 cmpdi r1,1
>     791 cmpwi r1,1
>     794 addis r1,r1,1
>    1474 std r1,1(r1)
>    1846 li r1,1
>    2031 mr r1,r1
>    2473 addi r1,r1,1
>    3012 nop
>    3028 ld r1,1(r1)

this is really handy.

ld immediate clearly at the top.  li is a compound pseudo op, also at the top.  mv, addi/s, make sense.  extsw not so much.

cmp immediate i wasn't expecting.  will add that.

 lots involving link register and blr, clearly that is for function calls.

Comment 16 Jacob Lifshay 2020-11-16 01:19:10 GMT

(In reply to Luke Kenneth Casson Leighton from comment #14)
> whilst a case for using the spare bits 2-4 as FP mode bits (single/double,
> rounding) is fairly clear, for INT it is less so

I'd argue for a bit for f32/f64 and a bit for fp->int trunc vs. round, but otherwise rounding mode should always be dynamic and not encoded -- use 32-bit if you need something else, since that is what's used in >99% of code.

The left over bits can be used for encoding other more important instructions.

> a trick:
> 
> | N |   | RT!=0 | | 011 | RB    | RA!=0 | 0 | M | sub.
> | N | 0 | 000   | | 011 | RB    | RA!=0 | 0 | M | cmpw
> | N | 1 | 000   | | 011 | RB    | RA!=0 | 0 | M | cmpl
> 
> by testing RT nonzero as the destination this leaves the actual subtraction
> (which is exactly what is used for cmp) to be tested for cmp but not stored,
> exactly as with cmp, when RT=0b000
> 
> additionally one of the downsides of only 2 regs for operations is that it
> is necessary to use a lot of mv operations to substitute for the lack.

compilers can get the number of necessary mv instructions waay down, they have lots of experience from x86.

> with mv being 16bit and the op itself being 16bit there is no point, you
> might as well use the 32bit version which can do dest,src1,src2
> 
> which means that having "dest, src1, src2" gives an advantage as it saves 16
> bits.

except that using those bits instead to encode more immediate bits for ld, addi, st most likely gives more advantage.

Comment 17 Jacob Lifshay 2020-11-16 01:25:58 GMT

(In reply to Luke Kenneth Casson Leighton from comment #15)
> this is really handy.

thx

> ld immediate clearly at the top.  li is a compound pseudo op, also at the
> top.  mv, addi/s, make sense.  extsw not so much.

extsw is required for passing a signed 32-bit int according to the Power ABI, most instructions don't sign-extend their result (they often leave the upper 32-bits undefined) so extsw is required.

Comment 18 Luke Kenneth Casson Leighton 2020-11-16 01:47:41 GMT

(In reply to Jacob Lifshay from comment #16)
> (In reply to Luke Kenneth Casson Leighton from comment #14)
> > whilst a case for using the spare bits 2-4 as FP mode bits (single/double,
> > rounding) is fairly clear, for INT it is less so
> 
> I'd argue for a bit for f32/f64 and a bit for fp->int trunc vs. round, but
> otherwise rounding mode should always be dynamic and not encoded -- use
> 32-bit if you need something else, since that is what's used in >99% of code.

ok do you want to have a go at reencoding the FP section? i have not studied it nearly as much as the INT v3.0B so you would likely do a better job than me anyway.

plus it's 130am here now

> The left over bits can be used for encoding other more important
> instructions.

agreed.


> compilers can get the number of necessary mv instructions waay down, they
> have lots of experience from x86.

mm... mumble but not entirely convinced
 
> > with mv being 16bit and the op itself being 16bit there is no point, you
> > might as well use the 32bit version which can do dest,src1,src2
> > 
> > which means that having "dest, src1, src2" gives an advantage as it saves 16
> > bits.
> 
> except that using those bits instead to encode more immediate bits for ld,
> addi, st most likely gives more advantage.

ah i mean in the arithmetic section (the trick of cmp applies/applied) which has no immediates.

if you mean right at the top, the ldi, then at the cost of all ldi/sti being this:

   ldi r5, r5(20) # or something like it

then yes it would be possible to get another 3 bits on the immediates... for ldi/sti

or use one bit to indicate lwi/stwi

another way to extend the range of immediates: align them.  ldi must be 8 byte aligned, lwi must be word aligned.
not unreasonable.

however if you meant in the brownfield arithmetic encoding use bits 2-4 to extend ldi/sti nooo the encoding is already complex.

Comment 19 Luke Kenneth Casson Leighton 2020-11-16 01:48:59 GMT

(In reply to Jacob Lifshay from comment #17)

> extsw is required for passing a signed 32-bit int according to the Power
> ABI, most instructions don't sign-extend their result (they often leave the
> upper 32-bits undefined) so extsw is required.

yuk.  ok that explains it.  i added extsb for no good reason.

Comment 20 Jacob Lifshay 2020-11-16 03:24:08 GMT

(In reply to Luke Kenneth Casson Leighton from comment #19)
> (In reply to Jacob Lifshay from comment #17)
> 
> > extsw is required for passing a signed 32-bit int according to the Power
> > ABI, most instructions don't sign-extend their result (they often leave the
> > upper 32-bits undefined) so extsw is required.
> 
> yuk.  ok that explains it.  i added extsb for no good reason.

signed 8 and 16-bit integers have the same issue. unsigned integers also require zero-extension. maybe we should add all the sign/zero extension instructions?

See: https://gcc.godbolt.org/z/9n48Gh

Comment 21 Jacob Lifshay 2020-11-16 03:34:00 GMT

(In reply to Luke Kenneth Casson Leighton from comment #18)
> ok do you want to have a go at reencoding the FP section? i have not studied
> it nearly as much as the INT v3.0B so you would likely do a better job than
> me anyway.

Sure, though I'll probably put that off till tomorrow.

> > except that using those bits instead to encode more immediate bits for ld,
> > addi, st most likely gives more advantage.
> 
> ah i mean in the arithmetic section (the trick of cmp applies/applied) which
> has no immediates.
> 
> if you mean right at the top, the ldi, then at the cost of all ldi/sti being
> this:

I mean we could have the particularly common instructions take up 2 opcode slots, using 1 of the opcode bits for an immediate bit.

We could then use the free space from the other instruction as a second opcode field to allow encoding the instruction moved to make room for the two opcode slots.

Or something like that.

If we need to, we could have load instructions have 1 more bit of immediate than store instructions, since load instructions are more common.

> another way to extend the range of immediates: align them.  ldi must be 8
> byte aligned, lwi must be word aligned.
> not unreasonable.

Exactly what RVC does :)

Comment 22 Luke Kenneth Casson Leighton 2020-11-16 14:21:10 GMT

(In reply to Jacob Lifshay from comment #21)
> (In reply to Luke Kenneth Casson Leighton from comment #18)
> > ok do you want to have a go at reencoding the FP section? i have not studied
> > it nearly as much as the INT v3.0B so you would likely do a better job than
> > me anyway.
> 
> Sure, though I'll probably put that off till tomorrow.

no problem at all.  some gotchas:

* i've re-orged the tables so that bits 567.8 are now major.minor
* major 0b011 is designated "sub" across both Arith and FP
  - 0b011.0 is arith sub./neg./cmp (producing CR0 unconditionally)
  - 0b011.1 is FP fsub./fneg.      (producing CR1 unconditionally)
* the 0b110 major opcode is entirely dedicated to FP

> > if you mean right at the top, the ldi, then at the cost of all ldi/sti being
> > this:
> 
> I mean we could have the particularly common instructions take up 2 opcode
> slots, using 1 of the opcode bits for an immediate bit.

space is so ridiculously tight that several compromises are taken.  2
opcode slots means "dropping mul or neg" for example
 
> We could then use the free space from the other instruction as a second
> opcode field to allow encoding the instruction moved to make room for the
> two opcode slots.
> 
> Or something like that.

i see where that's going: where they'd have RT=RA (add RT,RT,RA format)

if it wasn't for the 10-bit => 16-bit thing i'd say "great", straight
away.  however the way i see it is that if trying to jam immediates
into 10-bit the pressure on the major opcode space is so great that
i prioritised non-immediates with the exception of branch (and even
there, nop and illeg are encroaching on that)

then, only the 16-bit immediate modes kick in when bits 0 and f are 0b11
which... looking at the latest version bit 8 (minor opcode) i *think*
might aslso be free to allocate to offset... it is!

oo exciting there's a free major immediate-qualified opcode, ooo!
any suggestions?  extend addi? add sti / fsti...

> If we need to, we could have load instructions have 1 more bit of immediate
> than store instructions, since load instructions are more common.

... ah yeah, ok, i added sti / fstwi back in

> > another way to extend the range of immediates: align them.  ldi must be 8
> > byte aligned, lwi must be word aligned.
> > not unreasonable.
> 
> Exactly what RVC does :)

*sotto voice* where i got the idea from

Comment 23 Luke Kenneth Casson Leighton 2020-11-16 15:18:41 GMT

| 0 | 1  | 2 3 4 | | 567.8 | 9ab  | c d e | f |
| 1 | o2 |  RT   | | 010.0 | RB|0 | offs  | 1 | addi.
| 1 | o2 |  RT   | | 010.1 | RB|0 | offs  | 1 | addis.

these i'm reluctant to go entirely immediate in bits 2-4 because when
RB=0 it encodes "li RT, #imm" even though the immediate is only in
the range 0-15 for addi and -8 to 7 for addis.

(actually a better encoding there would be just addis, why has nobody
noticed a massive overlap between addis and addi before in OpenPOWER
ISA v3.0B????)

for each it miiight be ok to use 1 more bit (range -16 to +15) by reducing
RT down to 2 bits (regs r0-r3)

Comment 24 Luke Kenneth Casson Leighton 2020-11-16 17:08:40 GMT

oo i just had an idea.  in the immediates (aadi, addis, cmpwi, cmpdi) allow RT in the 16 bit mode *but*: when RT==RA shift the immediate by RT.

with RA/RT being 3 bit this will allow access to hword/word/dword-aligned constants and more.  the encoding is similar to FP, in that when RT==RA it becomes an exponent.

interestingly with RT==RA==0 covering all other cases of RT==0, RA!=0, we have a brownfield encoding opportunity.

i am reluctant to suggest using POSIT encoding here sigh.

Comment 25 Luke Kenneth Casson Leighton 2020-11-17 21:52:46 GMT

oooo i just had an awesome idea.  one of the problems we have with trying to use an existing scalar ISA for GPU is: 32 bit opcodes just don't fit.  zero space for swizzles etc.

i just realised that after compression, SV-C48 or better SV-C64 leaves enough space to add 12 bits for full swizzle selection.

how did that work again? a mask is needed to say which 4 from a vec4 are to be selected (0110 says YZ to be used out of XYZW) then 2-2-2-2 are the indices to specify the order?

that's 8 per reg though.  which means 4+8+8 bits needed to specify mask plus src1-shuffle plus src2-shuffle.

however at least 12 bits would apply to a mv.

or, is it normal to apply the same swizzle to src1 and src2?

Comment 26 cand 2020-11-18 07:47:06 GMT

Same swizzle for both srcs would be a gather operation from similar sources, aka the majority of cases won't be that (unless you count "no swizzle" as a swizzle, no swizzle being the most common case). Usually you swizzle because things are packed, maybe this texture is A8 while the other is RG16.

Comment 27 cand 2020-11-18 07:50:12 GMT

Why would you need 8 bits to specify the order of four elements though? 4! = 24, which is representable in 5 bits. This does need a lookup table in hw, but surely those are fast and cheap at that size.

Comment 28 Jacob Lifshay 2020-11-18 10:17:23 GMT

(In reply to cand from comment #27)
> Why would you need 8 bits to specify the order of four elements though? 4! =
> 24, which is representable in 5 bits. This does need a lookup table in hw,
> but surely those are fast and cheap at that size.

because it's not just specifying order, you can have swizzles that duplicate some input elements and skip other elements.

Comment 29 Jacob Lifshay 2020-11-18 10:26:02 GMT

(In reply to Luke Kenneth Casson Leighton from comment #23)
> | 0 | 1  | 2 3 4 | | 567.8 | 9ab  | c d e | f |
> | 1 | o2 |  RT   | | 010.0 | RB|0 | offs  | 1 | addi.
> | 1 | o2 |  RT   | | 010.1 | RB|0 | offs  | 1 | addis.
> 
> these i'm reluctant to go entirely immediate in bits 2-4 because when
> RB=0 it encodes "li RT, #imm" even though the immediate is only in
> the range 0-15 for addi and -8 to 7 for addis.

I'd recommend not supporting addis for 16-bit mode since most of the time addis is part of a addi addis sequence to add a 32-bit constant (usually some variable/function's address). That constant would almost always not fit in whatever reduced-range immediate you have available in 16-bit mode.

> (actually a better encoding there would be just addis, why has nobody
> noticed a massive overlap between addis and addi before in OpenPOWER
> ISA v3.0B????)

addis adds bits right above the bits you can specify with standard 32-bit addi. together they allow you to add an arbitrary 32-bit constant.

This is kinda like lui/addi on RISC-V.

Comment 30 cand 2020-11-18 10:32:35 GMT

> because it's not just specifying order, you can have swizzles that duplicate 
> some input elements and skip other elements.

I know. That's what the 4 bits are for. If the 4 bits happen to be 0110, aka the used values are yz, then you use the lookup table for yz's orderings. The size of the lookup table may not be exactly 24 for cases where less than four elements are present.

Comment 31 Luke Kenneth Casson Leighton 2020-11-18 11:02:26 GMT

(In reply to Jacob Lifshay from comment #28)
> (In reply to cand from comment #27)
> > Why would you need 8 bits to specify the order of four elements though? 4! =
> > 24, which is representable in 5 bits. This does need a lookup table in hw,
> > but surely those are fast and cheap at that size.
> 
> because it's not just specifying order, you can have swizzles that duplicate
> some input elements and skip other elements.

mask: 0bXYZW  select: vec4[0] 0bXX vec4[1] 0bYY vec4[2] 0bZZ vec4[3] 0bWW

total of 12 bits... *per vec4* src1/src2/dest (unfortunately)

the mask is actually predicate bits, applied to sub-elements (SUBVL),
hypothetically we could add this capability to SV, thus only needing
8 bits for selecting from the vec4, but normally it's
encoded into a constant/immediate/field in the instruction.

if only a permutation was needed the encoding could be done in less bits
but it is reasonable to have e.g. a vec4 select src1=all-X (4 copies of X) applied all to a vec4 src2=XYZW

Comment 32 Luke Kenneth Casson Leighton 2020-11-18 11:07:44 GMT

(In reply to Jacob Lifshay from comment #29)
> (In reply to Luke Kenneth Casson Leighton from comment #23)
> > | 0 | 1  | 2 3 4 | | 567.8 | 9ab  | c d e | f |
> > | 1 | o2 |  RT   | | 010.0 | RB|0 | offs  | 1 | addi.
> > | 1 | o2 |  RT   | | 010.1 | RB|0 | offs  | 1 | addis.
> > 
> > these i'm reluctant to go entirely immediate in bits 2-4 because when
> > RB=0 it encodes "li RT, #imm" even though the immediate is only in
> > the range 0-15 for addi and -8 to 7 for addis.
> 
> I'd recommend not supporting addis for 16-bit mode since most of the time
> addis is part of a addi addis sequence to add a 32-bit constant (usually
> some variable/function's address). That constant would almost always not fit
> in whatever reduced-range immediate you have available in 16-bit mode.

i added a note that the idea is, a 2nd opcode is dedicated to a reduced-shift
range *variant* of addis, which maps into *addi*.  what do you think?


> > (actually a better encoding there would be just addis, why has nobody
> > noticed a massive overlap between addis and addi before in OpenPOWER
> > ISA v3.0B????)
> 
> addis adds bits right above the bits you can specify with standard 32-bit
> addi. together they allow you to add an arbitrary 32-bit constant.
> 
> This is kinda like lui/addi on RISC-V.

yeah there's a sequence for getting 64-bit immediates without using LD
in PowerISA, it involves *six* instructions.

https://github.com/antonblanchard/microwatt/blob/master/hello_world/head.S#L34

/* Load an immediate 64-bit value into a register */
#define LOAD_IMM64(r, e)			\
	lis     r,(e)@highest;			\
	ori     r,r,(e)@higher;			\
	rldicr  r,r, 32, 31;			\
	oris    r,r, (e)@h;			\
	ori     r,r, (e)@l;

this is one of the reasons i'd like to add that Data-Pointer concept
(the above would be one instruction using DP)

Comment 33 Luke Kenneth Casson Leighton 2020-11-18 12:06:43 GMT

(In reply to Jacob Lifshay from comment #21)
> (In reply to Luke Kenneth Casson Leighton from comment #18)
> > ok do you want to have a go at reencoding the FP section? i have not studied
> > it nearly as much as the INT v3.0B so you would likely do a better job than
> > me anyway.
> 
> Sure, though I'll probably put that off till tomorrow.

 
btw there's a whole boatload of 16bit Cmaj.m=0b001.1 encodings free where bits 4, a and b are unused and cde still available for a single reg.

that could easily be allocated to int conversion... *if* the unusual step is taken of setting FRT==RA (or RT==FRA).  otherwise use only 4 bits for regs

| 0123 | 4 | | 567.8 | 9 ab | c de | f |
| 0010 | X | | 001.1 | 0 RA | Y RT | M |
| 0011 | X | | 001.1 | 0 RA | Y RT | M |

2x by using bit3. 2x by using bit 4 (X)
2x by using bit c (Y)

that gives 8 opcodes to do UINT/INT/FP
conversion.

Comment 34 cand 2020-11-18 13:25:52 GMT

I'm still missing why permutations would be ruled out. Even a pure permutation array covering all possibilities would be 4^4 = 256 values = 8 bits. It can represent XXXX, XYZW, all.

Comment 35 Luke Kenneth Casson Leighton 2020-11-18 13:50:30 GMT

(In reply to cand from comment #34)
> I'm still missing why permutations would be ruled out. Even a pure
> permutation array covering all possibilities would be 4^4 = 256 values = 8
> bits. It can represent XXXX, XYZW, all.

a permutation is defined mathematically as mutually exclusive moving, no copying.  4-factorial i.e 4x3x2x1.  that is the definition of a permutation and you only need 4 bits to specify a 4-permutation (12 options).

* X goes in 1 of 4 places
* that leaves 3 to choose from for Y
* that leaves 2 for Z
* W goes in the remaining slot.

YZWX. WYXZ.

except... we actually need repetitions.

XXZY. ZWWW.

that is not called a "permutation".  it is 4 separate and distinct indexed *selections* and for that we need 8 bits.

it would be really nice to only need 4 bits!

as it is we can probably get away with using MV.SWIZZLE by having the swizzle be part of prefix and place that on a standard C.mv op.

Comment 36 cand 2020-11-18 14:43:31 GMT

I guess we're talking past each other. I'm saying using a lookup table lets you save bits. Instead of 4+8 per vec4, you have 8.

Comment 37 Luke Kenneth Casson Leighton 2020-11-18 18:07:01 GMT

just checking even the basic default examole square() with godbolt.org shows that everything is covered, with the exceotion that reg 31 appears to be used by convention as a Stack Pointer (SP).

to cover that possibility it would be good to remap Compressed RA=0b111 to RA=0b11111 at least for LD/ST, and also to make sure that the commonly used regs from ppc64 ABI are smack in the reduced range.

Comment 38 Luke Kenneth Casson Leighton 2020-11-18 23:17:19 GMT

jacob spotted during the meeting today that it is r1 that is used as the SP by convention, not r31.

we therefore need at the minimum the option to select 2 regs for RA/RT on LD/ST imm.

we also want to maximise imm length and given that the current sradi/srawi is *already* using sub-sub-encoding taking up bits 1-4 it actually doesn't help with LD/ST as much as it would to instead steal space from e.g. cmpdi/cmpwi.

if we stole space from cmpi we would get 2x or 4x the amount of benefit than if we tried to steal from sradi/srawi because those share a sub-sub-encoding where the cmpis have one sub-encoding *each*.

i will mull this over however please do feel free to try some shuffles yourselves.

do watch out for the priority on the number of occurrences of each of the imm instructions, from the analysis of /bin/bash that jacob did

edit/update reminder

    466 extsw r1,r1
    649 stw r1,1(r1)
    691 lwz r1,1(r1)
    705 cmpdi r1,1
    791 cmpwi r1,1
    794 addis r1,r1,1
   1474 std r1,1(r1)
   1846 li r1,1
   2031 mr r1,r1
   2473 addi r1,r1,1
   3012 nop
   3028 ld r1,1(r1)

one of these will be to SP however cmpdi/cmpwi still have really quite high counts.  nothing we can really do about addis, we are not going to do the same trick as VLE (new immediate encodings).  just use 32bit addis.

Comment 39 Jacob Lifshay 2020-11-19 02:37:53 GMT

note that my analysis on bash doesn't cover branches since I don't canonicalize branch targets.

Comment 40 Luke Kenneth Casson Leighton 2020-11-19 03:09:20 GMT

(In reply to Jacob Lifshay from comment #39)
> note that my analysis on bash doesn't cover branches since I don't
> canonicalize branch targets.

i would say "understood" on that but honestly the words "canonicalise" and "branch targets" are a blank to me :)

could you elaborate on what is involved?

Comment 41 Luke Kenneth Casson Leighton 2020-11-19 03:17:10 GMT

i had a go at creating some ld/st r1 imm opcodes.  the imm range is reduced by 1 bit so as to fit the 4x combination of ld/st + d/w into only the 1 Cmaj opcode.

i took out addis and crammed cmpw/d into one Cmaj.m to get the full Cmaj to do so.

i have no idea what impact, statistically,  that will have on C.cmp, it would be necessary to see some of the values of immediates used.

however, coming back to ld/st r1: the immediate range is 0-127 even with 1 bit shared, which for doubles (ldi, stdi) is x8 so 0 to (1024-8) and for words (lwi, stwi) is x4 so 0 to (512-4)

which given it's primarily intended for stack usage seems pretty damn enormous if you ask me :)

Comment 42 Jacob Lifshay 2020-11-19 03:31:19 GMT

(In reply to Luke Kenneth Casson Leighton from comment #40)
> (In reply to Jacob Lifshay from comment #39)
> > note that my analysis on bash doesn't cover branches since I don't
> > canonicalize branch targets.
> 
> i would say "understood" on that but honestly the words "canonicalise" and
> "branch targets" are a blank to me :)
> 
> could you elaborate on what is involved?

basically, in the disassembly from objdump, branches look approximately like this:
bl 27F44 @foo
bl 45A10 @bar

canonicalization just means to convert them all to the exact same canonical text so `sort` will correctly group them together and count them as multiple occurrences of the same instruction instead of as all different instructions:
bl @addr
bl @addr

branch targets here just means the @addr text.

Comment 43 Jacob Lifshay 2020-11-19 03:44:48 GMT

(In reply to Luke Kenneth Casson Leighton from comment #41)
> i had a go at creating some ld/st r1 imm opcodes.  the imm range is reduced
> by 1 bit so as to fit the 4x combination of ld/st + d/w into only the 1 Cmaj
> opcode.
> 
> i took out addis and crammed cmpw/d into one Cmaj.m to get the full Cmaj to
> do so.
> 
> i have no idea what impact, statistically,  that will have on C.cmp, it
> would be necessary to see some of the values of immediates used.
> 
> however, coming back to ld/st r1: the immediate range is 0-127 even with 1
> bit shared, which for doubles (ldi, stdi) is x8 so 0 to (1024-8) and for
> words (lwi, stwi) is x4 so 0 to (512-4)
> 
> which given it's primarily intended for stack usage seems pretty damn
> enormous if you ask me :)

for stack ld/st, we want the immediates to have a range down to -256 bytes but no farther, since leaf functions all store their local variables below the stack pointer, since iirc the ABI has a 256 byte range (the red zone) where signal/exception handlers aren't allowed to modify those bytes since they are dedicated to the currently executing function.

I think we should just have the decoder hardware subtract 256 from the immediate after shifting.

so:
ld r10, 0x10(r1)
would be encoded using immediate (0x10 + 256) >> 3 == 0x22
and
ld r10, -0x58(r1)
would have immediate (-0x58 + 256) >> 3 == 0x15

Comment 44 Luke Kenneth Casson Leighton 2020-11-19 03:53:12 GMT

(In reply to Jacob Lifshay from comment #43)


> ld r10, 0x10(r1)
> would be encoded using immediate (0x10 + 256) >> 3 == 0x22
> and
> ld r10, -0x58(r1)
> would have immediate (-0x58 + 256) >> 3 == 0x15

took me a while to realise you inverted the calculation, there.  the same startpoint would work for lw/stw except >>2

Comment 45 Luke Kenneth Casson Leighton 2020-11-19 04:01:23 GMT

(In reply to Jacob Lifshay from comment #42)

> basically, in the disassembly from objdump, branches look approximately like
> this:
> bl 27F44 @foo
> bl 45A10 @bar
> 
> canonicalization just means to convert them all to the exact same canonical
> text

ah ok.  there may be an option to objdump that helps with that.  set the machine target to raw.  i was playing with objdump 6 months ago, i know it can be done.

Comment 46 Luke Kenneth Casson Leighton 2020-11-21 01:18:25 GMT

https://godbolt.org/z/enTPes

hmmm this short example seems to be using an awful lot of r1 *and* r31 LD/ST

Comment 47 Luke Kenneth Casson Leighton 2020-11-21 10:33:30 GMT

discussion moved to bugtracker
http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001229.html

On 11/21/20, Alexandre Oliva <oliva@gnu.org> wrote:
> On Nov 20, 2020, Luke Kenneth Casson Leighton <lkcl@lkcl.net> wrote:
>
>> jumping straight into practical matters: we're in the middle of
>> getting SimpleV redone on top of OpenPOWER, and a first step there is
>> Compressed Instructions.
>
> So, if I understand what I read in bug 238, we are looking into
> introducing an extension to PowerISA for code compactness, a little like
> Thumb vs ARM,

more accurately, like RISCV RVC where there is only one instruction that does not map directly to 32bit (the mv instruction)

> with 16-bit instructions rather than 32-bit ones, so that
> fragments of code that fit certain constraints, such as using only a
> subset of the register file, sufficiently-narrow immediate operands and
> offsets, and a limited set of operations, can be represented in such
> shorter instructions, so as to save instruction cache, memory, bus
> traffic, etc.

correct.  also this stops massive ISA opcode proliferation.  SIMD as a concept is an O(N^6) opcode proliferation (why do you think x86 now has.. what... 128 bit instruction length, now? 256?)

> I get the idea that the transition between 32-bit instructions and
> 16-bit instructions is to be dynamic, rather than based on the
> instruction encoding.  This raises various concerns to me, from a
> toolchain engineer perspective.

excellent.

> One is how to mark fragments of code so that the tooling can tell
> whether to emit or decode 16-bit or 32-bit instructions.  Say, how is
> the disassembler supposed to tell whether it's looking at a 32-bit
> instruction or a pair of 16-bit instructions?

by looking at the context starting from the function entrypoint, which conforms to standard ABIs and therefore starts only and exclusively from standard v3.0B glibc ABIs.

if that does not exist it will, just like LE/BE, be passed in via parameters.

> Just to give an example of what I'd like to avoid, the SH port has
> floating-point instructions that become single- or double-precision ones
> depending on a bit dynamically set in a control register.

sounds fantastic.  saves vast ISA proliferation.  delighted to hear there is precedent.

do you have a link to the SH port? i have no idea what it is.

>  Just by
> looking at the instruction, there's no way to tell whether it's single-
> or double-precision.

yep.  given that we are retrofitting an existing ISA with "context" in a massive way (bits that are part of the instruction decode but are in a Status Control Register) tooling... well.. has to suck it up and deal with it.

the enntiiire concept we are doing is predicated on "hidden state" (context). Vector Length. predication. element-width overrides.  bank switching.  absolutely everything.

that said: given that this is not going to cross, interfere with or alter ABIs, every function is *required*, clearly, to start from the known standard v3.0B glibc ABI, and every function call must reset back to that known standard ABI, at exit.

therefore every function is *required* to have the exact context you are concerned might be missing.

now, if this SH port did not do that, then of course they are going to run into trouble.

> Are sections to be marked as 32- or 16-bit code, so that transitions
> between modes has to also jump from one section to the other?

the bitmarking and context is intended and designed to avoid precisely that.

the reason is that jumps between sections poisons the entire purpose of having a reduced code size.  one C instruction followed by 6 instructions to LD a 64 bit immediate followed by a 32 bit jump cannot in any way be considered a reduction, can it?

> Are labels to be marked, so that e.g. odd addresses are encoded in the
> more compact mode? 

if that's what it takes for IR generators to communicate a desire to align a label on a specific 16 bit boundary.

however i would suggest simply looking at what RVC does and going with that.

> Can calls and return insns transition between modes?

this would require the definition of or the modification of an ABI that is different from the standard ABI therefore the answer is a resounding and emphatic no.

> Can code at the same address ever be executed in both modes? 

this whilst hypothetically possible would fall into the "foot shooting" category as it would be statistically fantastically improbable.

the only instructions specifically designed with this in mind are nop and illegal.

basically attempts to try this should be emphatically and strongly discouraged and we are in no way going to try to design the ISA to be dual-mode (except for illegal instruction, which is all zeros, so as to cause illegal instruction traps as quickly as possible if a program goes wrong)

> E.g., when
> it comes to dynamic linkers, could there possibly be two GOT and PLT
> entries for the same address, one for each mode, when labels refer to
> the same address except for the mode?

you may be referring to the multi-compilation mode being championed by the RISCV community (not repeat not the same thing as multilib)

multiple recompilations of the exact same function with multiple different flags and ISA capabilities are embedded into the exact same object file and the dynamic linker is expected, at runtime, to pick and fixup to the correct **version** of that function.

these multiple versions all conform to the exact same ABI.

> Another concern is on tooling.  Though a compiler might not have too
> much trouble to figure out that a chunk of code fits the constraints
> that enable the use of the compact mode, that's not quite as useful or
> involved as having the compiler try to make the code fit the
> constraints.  Allocating registers for the constrained register profile,
> reordering code so as to move unfit operations out of the fragment, or
> falling back to alternate operations, scheduling instructions according
> to the execution properties of each mode...  These don't seem to fit
> very well in the current compilation model.

RISCV deals with it, therefore we expect the required code and techniques to exist and to be adaptable to a Compressed OpenPOWER ISA.

> I don't wish to come across as too negative, but it makes me wonder if
> the potential savings this feature will enable won't just go to waste
> because the tools won't be able to take advantage of them.  I wonder if
> it wouldn't make more sense to save this idea for a future development,
> rather than in the critical path for the very first product.  

we have to start somewhere.  if we never start no innovation takes place.

> I can see,
> however, how much of a breakthrough it can be, and how compelling it can
> make the processor, if the potential is realized.

indeed.  and given that RISCV has done (is doing) exactly the same thing, they have teams of experts being paid huge sums of money in the order of around a million dollars a year estimated to sort this out.

we simply have to wait for the code to hit mainline gcc and llvm and learn from it.

also given the compelling code and power reduction i expect some uptake from IBM. this may take a while and require at least proof of concept, demonstrable precedent, and so on.

Comment 48 Luke Kenneth Casson Leighton 2020-11-21 10:53:46 GMT

On 11/21/20, Cole Poirier <colepoirier@gmail.com> wrote:

> idea is to do the opposite of ARM vs thumb. The op codes in the compressed
> ISA extension will simply be duplicate versions of ones that exist in the
> 3.0B version of the OpenPowerISA,

correct

> and that the 16 bit instructions will be
> in the same stream and addresses as the 32 bit standard ones.

correct.  up to this point it is exactly as in RVC from RISCV.

therefore, up to this point, the exact same tooling techniques apply.

> There will be
> one instruction to indicate entrance into 16 bit instruction mode, one to
> exit back to normal 32 bit mode, one to execute *only* the next instruction
> in 32 but mode then return to 16 bit mode

this is the point at which things deviate from how it works with ARM and RVC.

as explained in the introduction section RVC is specifically designed from the groubd up with RVC in mind.  3 2bit tags are used in the LSBs as 16bit identifiers; the 4th (0b11) is a Huffman escape sequence to indicate 32 bit mode.

this technique pressurises the 32 bit mide by reducing available Major opcode Space but that is another story.

to achieve the same thing for OpenPOWER we would need to take up something mad like 24 Major Opcodes (anything and everything with 0b00/01/10 in the bottom 2 LSBs).

clearly that is flat-out unacceptable therefore the alternative is to use "hidden context".

hidden context i.e. state aka "namespaces" aka ISAMUX aka ISANS aka escape-sequencing basically has bits that come from a Special Purpose Register (aka CSR) actually go directly into the instruction decoder as if it was actually part of that instruction, because, well, in reality and in actual fact, it is.

Comment 49 Luke Kenneth Casson Leighton 2020-11-21 11:03:50 GMT

(In reply to Luke Kenneth Casson Leighton from comment #47)

> On 11/21/20, Alexandre Oliva <oliva@gnu.org> wrote:

> > One is how to mark fragments of code so that the tooling can tell
> > whether to emit or decode 16-bit or 32-bit instructions.  Say, how is
> > the disassembler supposed to tell whether it's looking at a 32-bit
> > instruction or a pair of 16-bit instructions?

we have identified no less than *three* places where "hidden state" completely changes the meaning of instructions in OpenPOWER ISA, already.

what we are doing is really no different although it is not so common to change state during a function call.

these are (so far identified):

* LE/BE mode
* 32/64 mode
* FP Accuracy mode

whilst the two first ones tend to be on a per-program basis, the last one definitely is not.  this is specifically intended to be used to get FP hardware to "speed up" by returning a less-accurate answer when accuracy is not critical to the application.

clearly a given function can in NO WAY be left in this mode so it is set, the instruction called, and the relevant FPSPR bit reset.

that gcc, llvm and binutils have no idea that this FP mode even exists is not IBM's problem.

given that the precedent exists we are exploiting that, admittedly to an extent that is unprecedented in the history of computing.

Comment 50 Luke Kenneth Casson Leighton 2020-11-21 11:32:14 GMT

https://news.ycombinator.com/item?id=14518758

Itanium failed because it was a VLIW microarchitecture.  you cannot deploy OoO techiques on VLIW or i would hazard a guess that anyone trying would run away screaming.

On 11/21/20, Lauri Kasanen <cand@gmx.com> wrote:

> This reminded me that there is predecent in LLVM. Some folks used
> LLVM's optimizers on LZ-style compression. It's a hard, NP-class
> problem on how to get the minimum compressed size - should I use this
> dict entry but a worse one later, or maybe try another combination.

and, even getting it right you have the pressure of how much time it takes.

all that, whilst it is an issue, it is one that over time can be taken care of.  unlike VLIW which is a known-failed path there are people working on and solving this path.

Comment 51 Jacob Lifshay 2020-11-22 01:22:40 GMT

(In reply to Luke Kenneth Casson Leighton from comment #46)
> https://godbolt.org/z/enTPes
> 
> hmmm this short example seems to be using an awful lot of r1 *and* r31 LD/ST

That's because you forgot to enable optimizations -- we *should not* be optimizing compressed instructions for gcc's unoptimized output because that's not the code you'll see when people care about going fast and/or being small.

Comment 52 Jacob Lifshay 2020-11-22 02:07:41 GMT

(In reply to Jacob Lifshay from comment #51)
> (In reply to Luke Kenneth Casson Leighton from comment #46)
> > https://godbolt.org/z/enTPes
> > 
> > hmmm this short example seems to be using an awful lot of r1 *and* r31 LD/ST
> 
> That's because you forgot to enable optimizations -- we *should not* be
> optimizing compressed instructions for gcc's unoptimized output because
> that's not the code you'll see when people care about going fast and/or
> being small.

Some additional details: For PowerPC64, most functions will have a statically-sized stack frame, which allows using offsets from r1 for all stack frame memory access. The only code that uses r31 is code that has a variable-size stack frame (or where the compiler just doesn't know -- like the unoptimized code linked above). Therefore, I think we shouldn't have special compressed instructions just for the frame pointer, but I do think we should have the frame pointer included in the registers usable by compressed instructions.

Comment 53 Jacob Lifshay 2020-11-22 02:08:57 GMT

(In reply to Jacob Lifshay from comment #52)
> Some additional details: For PowerPC64, most functions will have a
> statically-sized stack frame, which allows using offsets from r1 for all
> stack frame memory access. The only code that uses r31 is code that has a
> variable-size stack frame (or where the compiler just doesn't know -- like
> the unoptimized code linked above). Therefore, I think we shouldn't have
> special compressed instructions just for the frame pointer, but I do think
> we should have the frame pointer included in the registers usable by
> compressed instructions.

r31 can be used as a temporary register as well, so it's not like we're wasting a compressed register slot.

Comment 54 Luke Kenneth Casson Leighton 2020-11-22 03:11:37 GMT

(In reply to Jacob Lifshay from comment #53)

> r31 can be used as a temporary register as well, so it's not like we're
> wasting a compressed register slot.

got it.  all good points esp. about optimised code being a priority. alexandre tracked down the reg priority allocation from gcc.

Comment 55 Luke Kenneth Casson Leighton 2020-11-22 19:06:45 GMT

additional discussion at:
http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001267.html

Comment 56 Luke Kenneth Casson Leighton 2020-11-22 19:41:12 GMT

[moved from bug #529] aoliva wrote:

I can't help the feeling that we're wasting precious encoding bits to remain in 16-bit mode.  We could easily double the number of accessible registers using them, and having an encoding similar to that of nop to switch out.

I'm also thinking that, if we use 6 bits just to enter 16-bit mode, and the remaining 10 bits, though theoretically useful, will most often go to waste, and we were willing to spend 2 bits to remain in it, we might as well use one-byte nops (suitably encoded) to switch between modes, and have 16-bit instructions using the whole 16 bits.  We could then have 16-bit instructions starting at odd addresses, and 32-bit ones at even addresses, sort of like ARM/Thumb, SH5media/SHcompact, etc, but for real.

(I realize this would make every other instruction cross a word boundary; we don't seem to have ruled out doing so for 32-bit insns, so I assume we have figured out how to save part of one word used for one insn, to use along with part of the next word that holds the other half of the insn)

We could have an even more extensive opcode selection by giving up 3-register encodings, and follow the practice of other compact encodings of using a single register as input and output operand.  Most (*) 3-operands insns x := y op z can be turned into x := y ; x op= z, which is no worse and probably better than switching to 32-bit mode, and we could have an extend-next pseudo-insn to supply an extra operand and/or extra immediate/offset bits to the subsequent insn.

(*) the exception being x <- y op x, when op is not commutative

With the even/odd address distinction, we could have code labels, even function entry points, at odd addresses.  Relative branches in 32-bit mode wouldn't be able to reach them, but for those, we could just insert a nop across the label:

  <32-bit conditional branch to L>
  <32-bit insn>
  < 8-bit switch to 16-bit mode>
  <16-bit insn>
  < 8-bit first half of 16-bit nop>
L:
  < 8-bit second half of 16-bit nop, AKA 8-bit switch to 16-bit mode>
  <16-bit insn>
  <16-bit insn>
  <32-bit 16-bit extend-next + 16-bit extended insn>
  ...

Should I develop these possibilities further?

Comment 57 Luke Kenneth Casson Leighton 2020-11-22 20:34:34 GMT

(In reply to Luke Kenneth Casson Leighton from comment #56)
> [moved from bug #529] aoliva wrote:
> 
> I can't help the feeling that we're wasting precious encoding bits to remain
> in 16-bit mode.  We could easily double the number of accessible registers
> using them, and having an encoding similar to that of nop to switch out.

this was discussed in comment #8

> I'm also thinking that, if we use 6 bits just to enter 16-bit mode, and the
> remaining 10 bits, though theoretically useful, will most often go to waste,
> and we were willing to spend 2 bits to remain in it, we might as well use
> one-byte nops (suitably encoded) to switch between modes, and have 16-bit
> instructions using the whole 16 bits.  We could then have 16-bit
> instructions starting at odd addresses, and 32-bit ones at even addresses,
> sort of like ARM/Thumb, SH5media/SHcompact, etc, but for real.

this would require complex scheduling and sizing of instructions to get the perfect alignment.

of course the imperfect version would simply throw in nops.

if we knew in advance that instruction streams were to remain in 16bit mode consistently for considerable durations i would be much more in favour of schemes that had to rely on nops to transition between modes.

> (I realize this would make every other instruction cross a word boundary; we
> don't seem to have ruled out doing so for 32-bit insns,

yep not a problem as long as 32 bit instruction branches are thought through
(v3.0B requires ignoring the bottom 2 bits of the PC)

> so I assume we have
> figured out how to save part of one word used for one insn, to use along
> with part of the next word that holds the other half of the insn)

yep, just a queue data structure that can hold at least 48 bits.

> We could have an even more extensive opcode selection by giving up
> 3-register encodings, 

maybe.  bear in mind the idea is to encode the most common instructions.  this is where we really need that statistical analysis.

> and follow the practice of other compact encodings of
> using a single register as input and output operand. 

this was discussed in comment #18

> Most (*) 3-operands
> insns x := y op z can be turned into x := y ; x op= z, which is no worse and
> probably better than switching to 32-bit mode,

except: look at the amount of space used to do so.  it's still 32 bit, isn't it?

which means, it's "par" (no advantage eitther way)

and that puts a 16-bit instruction which has 2 src 1 dest at a significant advantage.

bear in mind: i recognise that on the transition from v3.0B to 16bit mode it's impossible to fit 3 operands, hence all 10bit encodings are 1 src 1 dest.

> and we could have an
> extend-next pseudo-insn to supply an extra operand and/or extra
> immediate/offset bits to the subsequent insn.

i very much do not want to go down this particular route, although it is very tempting to do so.

the problem is that it is basically the design route taken by CISC ISAs.  the extend sequence never ends once the door is opened up, and the encoding FSM becomes a nightmare (particularly given that we have SV Prefixing to also add)

i would greatly prefer that whatever we cannot fit into 16 bit we simply drop back into v3.0B compatibility.

remember that proposals for ISA Extensions have to be reviewed by the OpenPOWER Foundation.  the probability of acceptance is a product of two factors:

* complexity
* benefit

if the complexity is high then the benefit had better be waay higher.

> (*) the exception being x <- y op x, when op is not commutative
> 
> With the even/odd address distinction, we could have code labels, even
> function entry points, at odd addresses.  Relative branches in 32-bit mode
> wouldn't be able to reach them, but for those, we could just insert a nop
> across the label:
> 
>   <32-bit conditional branch to L>
>   <32-bit insn>
>   < 8-bit switch to 16-bit mode>
>   <16-bit insn>
>   < 8-bit first half of 16-bit nop>
> L:
>   < 8-bit second half of 16-bit nop, AKA 8-bit switch to 16-bit mode>
>   <16-bit insn>
>   <16-bit insn>
>   <32-bit 16-bit extend-next + 16-bit extended insn>
>   ...

ahh... ahh... this is novel.  effectively it's using the bottom 2 bits (actually, bit 1) to indicate the 16/32 bit mode.

that needs some exploration.

> Should I develop these possibilities further?

yes definitely

Comment 58 Luke Kenneth Casson Leighton 2020-11-22 20:36:51 GMT

(again moved from bug #529)

Uhh, sorry, the comment above was really meant for bug 238.  I'll copy it there.  There were elements of the proposal to use odd addresses that would be relevant here, but that I haven't elaborated on.

Specifically, we could make that work for both BE and LE: in 32-bit mode, the most significant byte (regardless of code endianness) is the one that might hold the 8-bit nop-ish insn to switch to 16-bit mode, since that's where the EXT bits are.

As for 16-bit mode, the 4n+1 address would be one of the two bytes of the insn full-contained in the word, and 4n+2 would be the other, regardless of endianness.

At 4n+3 addresses, we'd have to go for the least significant byte of the word at 4n for the first half of the 16-bit insn, again the one holding the opcode, so that we can tell whether it's a mode-switching nop or we are to fetch the second half from the next word.

If we're using misaligned 32-bit addresses, again we ought to use the LSB half of the lower-address word as the MSB of the insn, to recognize the opcode (possibly an 8-bit nop), and the MSB half of the higher-address word for the remaining bits of the 32-bit insn split across two words.

Comment 59 Luke Kenneth Casson Leighton 2020-11-22 20:46:06 GMT

(In reply to Luke Kenneth Casson Leighton from comment #58)

> might hold the 8-bit nop-ish insn to switch to 16-bit mode, since that's
> where the EXT bits are.

ahh.. ahh... ah ha! this is a great idea.  allow 16bit instructions to sit on *byte* boundaries, flip into 16bit mode with an 8 bit indicator, being the byte containing the v3.0B Major Opcode.

what's really nice about this is: actually you only need the one v3.0B Major Opcode.

also if there are any circumstances where only 8bit encodings would suffice these can be used to switch modes (within the 16 bit space).

i like it.  lots of possibilities to think through, here.

Comment 60 Alexandre Oliva 2020-11-22 22:59:04 GMT

(In reply to Luke Kenneth Casson Leighton from comment #57)

>> I can't help the feeling that we're wasting precious encoding bits to remain
>> in 16-bit mode.  We could easily double the number of accessible registers
>> using them, and having an encoding similar to that of nop to switch out.

> this was discussed in comment #8

It was, but from the perspective of "let's try to get some of the most
common opcodes duplicated in this compressed form", which reminds me of
Thumb v1.  What I'm thinking is of a slightly different perspective, of
making the compressed form so extensive that you'd hardly ever have a
need to switch back to 32-bit mode, which is more like later versions of
Thumb.

Bear in mind that Thumb nowadays is a very complete instruction set.
Why not aim for that?


>> I'm also thinking that, if we use 6 bits just to enter 16-bit mode, and the
>> remaining 10 bits, though theoretically useful, will most often go to waste,
>> and we were willing to spend 2 bits to remain in it, we might as well use
>> one-byte nops (suitably encoded) to switch between modes, and have 16-bit
>> instructions using the whole 16 bits.  We could then have 16-bit
>> instructions starting at odd addresses, and 32-bit ones at even addresses,
>> sort of like ARM/Thumb, SH5media/SHcompact, etc, but for real.

> this would require complex scheduling and sizing of instructions to get the
> perfect alignment.

Err, I think we're miscommunicating.

There's no need for any special alignment.  The odd- and even- addresses
trivially follow from using a 1-byte switch insn.

> if we knew in advance that instruction streams were to remain in 16bit mode
> consistently for considerable durations i would be much more in favour of
> schemes that had to rely on nops to transition between modes.

That's what I'm aiming for with my suggestions.

>> We could have an even more extensive opcode selection by giving up
>> 3-register encodings, 

> maybe.  bear in mind the idea is to encode the most common instructions.  this
> is where we really need that statistical analysis.

>> and follow the practice of other compact encodings of
>> using a single register as input and output operand. 

> this was discussed in comment #18

Yeah, but dismissed without data.

Consider that 22% of the instructions that take 3 registers as operands
in the ppc-gcc binary I'm using for tests actually use the output
register as one of the input operands, without any effort by the
compiler to make them so:

$ ./objdump -d ppc-gcc | grep -c ' r[0-9]*,r[0-9]*,r[0-9]*$' # x <- y op z
5731
$ ./objdump -d ppc-gcc | grep -c ' \(r[0-9]*\),\1,r[0-9]*$'  # x <- x op z
673
$ ./objdump -d ppc-gcc | grep -c ' \(r[0-9]*\),r[0-9]*,\1$'  # x <- y op x
630

We (well, I :-) can easily reconfigure the compiler to prefer such a
form, without ruling out 3-different-register alternatives, and see how
far that gets us, but my intuition is that, with such a change, and
access to the full register file (as enabled by using 2 rather than 3
operands), we could then get very long sequences compressed of
instructions.

>> Most (*) 3-operands
>> insns x := y op z can be turned into x := y ; x op= z, which is no worse and
>> probably better than switching to 32-bit mode,

> except: look at the amount of space used to do so.  it's still 32 bit, isn't
> it?

Point was, even in the (to be exceptional) case of NOT getting an insn
that could be encoded as a single compressed insn from the compiler, you
could still remain in compressed mode.

And by making the switching out of compressed mode rare enough, we'd
strengthen the case for using the 2/16 bits for something more useful
than switching modes: the switch out would be an exception, thus a
special opcode rather than 1/8th of every insn.

And then, with the 1-byte mode switching, if we wanted to switch for 1
or n 32-bit insns, that costs us the space of one 16-bit insn: 8 bits
before the first 32-bit insn, and another 8 bits after the last one to
switch back.

> bear in mind: i recognise that on the transition from v3.0B to 16bit mode it's
> impossible to fit 3 operands, hence all 10bit encodings are 1 src 1 dest.

To me, the 10-bit encodings come across as complexity I'd much rather
get rid of.

>> and we could have an
>> extend-next pseudo-insn to supply an extra operand and/or extra
>> immediate/offset bits to the subsequent insn.

> i very much do not want to go down this particular route, although it is very
> tempting to do so.

Think of it as a 32-bit insn representation if you must ;-)

Thumb does that.  We're looking at 48-bit instructions that are hardly
different from that.

> if the complexity is high then the benefit had better be waay higher.

I don't really feel that an extend-next opcode that supplies an extra
operand and/or an extended immediate range for the subsequent insn that
would otherwise take it, respectively, from a repeat operand, or from a
sign-extension of a narrow immediate range is such a huge increase in
complexity, and I believe the benefits could be huge in enabling us to
remain in compressed mode for much longer.

> ahh... ahh... this is novel.  effectively it's using the bottom 2 bits
> (actually, bit 1) to indicate the 16/32 bit mode.

*nod*, just like Thumb vs ARM and SHcompact vs SHmedia.

Except they did not actually use misaligned insns crossing word
boundaries.  Presumably that would be a problem for them, but for us it
looks like it isn't, so we might as well take advantage of it ;-)

Comment 61 Luke Kenneth Casson Leighton 2020-11-22 23:00:04 GMT

Comment 62 Luke Kenneth Casson Leighton 2020-11-22 23:19:27 GMT

alexandre i have such a massive headache i'm not really able to concentrate.  brief comments below, can you start a 2nd section in the wiki page, "compressed 2nd version"

(In reply to Alexandre Oliva from comment #60)

> To me, the 10-bit encodings come across as complexity I'd much rather
> get rid of.

the entire SimpleV concept is based around context tagging.  it's not going away.

trying to drop this one tool in an arsenal of tools all based around the same state-based concept would be... fruitless.

i.e. to try to get 10bit to not exist because it requires state, when the entirety of SimpleV requires state, so that objdump will have an easier time, this is not going to happen because for objdump to support disassembly of SimpleV *will* require the state tracking i mentioned.

if however, disregarding state as being a "disadvantage", there are *better schemes* than the 10bit one, i'm happy to hear about them.


> I don't really feel that an extend-next opcode that supplies an extra
> operand and/or an extended immediate range for the subsequent insn that

SV Prefix (which you've not been introduced to yet) is already adding 2 bits per register.

Comment 63 Alexandre Oliva 2020-11-22 23:35:43 GMT

Sorry about your headache, I hope you get better soon.

I sense miscommunication here.  Regarding the attempts to use the 2 bits of the mode-switching byte, along with another byte to try to encode some useful instruction, is what I consider unnecessary complexity.

It has absolutely ZERO to do with the concerns of separate state needed to disassemble.  Indeed, the use of 8-bit switch-mode insns that I have proposed does not affect in any way the difficulties a disassembler might face when pointed at a word holding a single 32-bit insn, two halves of 32-bit insns, 2 16-bit insns, one 16-bit insn and two halves of other 16-bit insns, or any such thing.

Please don't conflate unrelated issues and, especially, don't reject ideas based on assumptions as to their motivations, especially if the assumptions are unfounded and, as it turns out, incorrect.

Comment 64 Jacob Lifshay 2020-11-23 00:19:13 GMT

I quite like the idea for encoding compressed instructions by having their address be odd, as long as we retain the ability to be completely backward-compatible with Power ISA v3.1 (including their 64-bit instructions) in both BE and LE modes.

IIRC the opcode byte is the third byte in LE mode, so, we'd need to conceptually decode bytes from aligned 32-bit words in instruction memory in this order:

...
word N: 3rd, 4th, 1st, 2nd
word N+1: 7th, 8th, 4th, 5th
word N+2: 11th, 12th, 9th, 10th
...

where all instructions are made from a sequence of 1 or more bytes

Comment 65 Luke Kenneth Casson Leighton 2020-11-23 00:28:15 GMT

(In reply to Alexandre Oliva from comment #63)

> Please don't conflate unrelated issues and, especially, don't reject ideas
> based on assumptions as to their motivations, especially if the assumptions
> are unfounded and, as it turns out, incorrect.

severe pain is stopping me from being able to concentrate and absorb.  can you make a start on an encoding in ikiwiki, start at the top level with how the FSM would decode (identify) modes, how 16/32 sequence together, cut/paste the ASCII layouts if it helps.

Comment 66 Luke Kenneth Casson Leighton 2020-11-23 07:39:47 GMT

(In reply to Jacob Lifshay from comment #64)
> I quite like the idea for encoding compressed instructions by having their
> address be odd, 

it's novel, i like that.  still need to think it through.  i'd like to see a table, 4-bytes per row, showing rough examoles of how this would work.  (expansion to bitlevel can be done after).

my first observation was that, forgetting the actual byte placement, ignoring the relative positioning/FSM and counting the "effectiveness" (bit usage), oliva's proposal is effectively:

* move everything along by one byte
  (in any given 16bit contiguous sequence)
* drop the 10bit encoding
  (and then further drop modeswitching)

in other words:

* 3 bits at the end of the EXTNNN Major
  opcode are currently unallocated:
  (can they be used at all?)
* the separation of those 3 bits plus
  the 8 of the alignment nop *used* to
  be the 10-bit encoding opportunity.

with those now split from each other that opportunity is gone.  is it worth the tradeoff of the LSB being the 16/32 mode identifier? this needs evaluating carefully.

key questions:

* can the 32 bit v3.0B instructions be
  identified even on 16bit boundaries?
  (4 byte per row tables will help
   illustrate if it is)
* is the loss of 10/11 bits statistically
  worth it? i.e. how many nops are there
  likely to be?

> as long as we retain the ability to be completely
> backward-compatible with Power ISA v3.1 (including their 64-bit
> instructions) in both BE and LE modes.

EXT001 is unfortunately one of the top candidates for use by Compressed (because it is right next to EXT000 which we need to encode all-zero illegal encoding, and it saves gates).

however this is not actually a problem because the Compressed and v3.1B Prefixing are mutually exclusive.  this just leaves ensuring that return to v3.1 Mode is "clean" (when the OPF EULA is updated to allow anyone to implement commercial v3.1 ASICs that is)

remember also that the dual half-word reordering is effectively identical to quad bytelevel reordering.  one is applied to memory prior to decoding, the other to the 32bit opcode.

essentially there is zero functional difference between them: it's just that one of them is termed "BE instruction encoding" and has an unfair stigma "BE is the loser therefore avoid" attached to it.

bottom line: if the 8 bit idea pans out but causes merry hell with the dual half-word memory reordering, we can always, just as is done with VLE Encoding, mandate that BE instruction encoding is required [note: *data* BE encoding *separate* and not included in that].

Comment 67 Jacob Lifshay 2020-11-23 07:46:25 GMT

(In reply to Luke Kenneth Casson Leighton from comment #66)
> bottom line: if the 8 bit idea pans out but causes merry hell with the dual
> half-word memory reordering, we can always, just as is done with VLE
> Encoding, mandate that BE instruction encoding is required [note: *data* BE
> encoding *separate* and not included in that].

Except that BE instruction encoding is incompatible with the PowerPC64LE ABI which requires that instruction encoding is set to LE on call/return/unwind.

Comment 68 Luke Kenneth Casson Leighton 2020-11-23 08:23:48 GMT

(In reply to Jacob Lifshay from comment #67)

> Except that BE instruction encoding is incompatible with the PowerPC64LE ABI
> which requires that instruction encoding is set to LE on call/return/unwind.

tck tck... thinks... does the dual halfword memory swapping similarly interfere with ppc64le ABI? i would expect the answer to that to be yes.

the *instruction* encoding is required to be LE: what about data?

remember we are expecting that all code set to known-good state (mode) on entry/exit (at the ABI points).

i would reaaally prefer that we not have to define a new ABI, it means recompiling entire OS Distros.

Comment 69 Jacob Lifshay 2020-11-23 08:46:09 GMT

(In reply to Luke Kenneth Casson Leighton from comment #68)
> (In reply to Jacob Lifshay from comment #67)
> 
> > Except that BE instruction encoding is incompatible with the PowerPC64LE ABI
> > which requires that instruction encoding is set to LE on call/return/unwind.
> 
> tck tck... thinks... does the dual halfword memory swapping similarly
> interfere with ppc64le ABI? i would expect the answer to that to be yes.

No it does not interfere, because it is changing both the order that bytes are conceptually read from memory as well as swapping the instruction encoding so that the combination matches the original unmodified 32-bit LE encoding.

> the *instruction* encoding is required to be LE: what about data?

that too, that's the whole point of the ppc64le ABI: the data is in LE format to match the de-facto industry standard (because of x86 and Arm being LE).

> remember we are expecting that all code set to known-good state (mode) on
> entry/exit (at the ABI points).

yup!

> i would reaaally prefer that we not have to define a new ABI, it means
> recompiling entire OS Distros.

I completely agree, hence why I'm pushing against just having instructions be in BE mode.

Comment 70 Luke Kenneth Casson Leighton 2020-11-23 12:42:17 GMT

(In reply to Jacob Lifshay from comment #69)
> (In reply to Luke Kenneth Casson Leighton from comment #68)
> > tck tck... thinks... does the dual halfword memory swapping similarly
> > interfere with ppc64le ABI? i would expect the answer to that to be yes.
> 
> No it does not interfere, because it is changing both the order that bytes
> are conceptually read from memory as well as swapping the instruction
> encoding so that the combination matches the original unmodified 32-bit LE
> encoding.

that's an important missing detail not made clear in the bugreport!
can you make it clearer in bug #529 with some examples? (then we add
them to the wiki page once it's completely clear)

Comment 71 Luke Kenneth Casson Leighton 2020-11-23 17:25:26 GMT

(In reply to Luke Kenneth Casson Leighton from comment #65)

> you make a start on an encoding in ikiwiki, start at the top level with how
> the FSM would decode (identify) modes, how 16/32 sequence together,
> cut/paste the ASCII layouts if it helps.

i made a start in the wiki page, can you confirm it looks reasonable?

| byte 0 | byte 1 | byte 2 | byte 3 |
| v3.0B standard 32 bit instruction |
| EXT000 | 16 bit          | 16...  |
| .. bit | 8nop   | v3.0b stand...  |
| .. ard 32 bit   | EXT000 | 16...  |
| .. bit | 16 bit          | 8nop   |
| v3.0B standard 32 bit instruction |

if it does, the next phases would be:

* make a preliminary assessment of branch in/out viability
* confirm FSM encoding (is LSB of PC really enough?)
* guestimate opcode and register allocation (without necessarily doing a full encoding)
* write throwaway python program that estimates compression ratio from objdump raw parsing
* finally do full opcode allocation
* rerun objdump compression ratio estimates

Comment 72 Luke Kenneth Casson Leighton 2020-11-23 20:24:56 GMT

ok coming back to this...

(In reply to Alexandre Oliva from comment #60)
> (In reply to Luke Kenneth Casson Leighton from comment #57)
> 
> >> I can't help the feeling that we're wasting precious encoding bits to remain
> >> in 16-bit mode.  We could easily double the number of accessible registers
> >> using them, and having an encoding similar to that of nop to switch out.
> 
> > this was discussed in comment #8
> 
> It was, but from the perspective of "let's try to get some of the most
> common opcodes duplicated in this compressed form", which reminds me of
> Thumb v1.  What I'm thinking is of a slightly different perspective, of
> making the compressed form so extensive that you'd hardly ever have a
> need to switch back to 32-bit mode, which is more like later versions of
> Thumb.
> 
> Bear in mind that Thumb nowadays is a very complete instruction set.
> Why not aim for that?

simple answer: time (as in: we're under time pressure)  additional answer: it becomes CISC, and is harder to justify to the OPF ISA Working Group.

consider these options:

* a complex (CISC-like) decoder that duplicates the entirety of the OpenPOWER v3.0B ISA
* a greatly aimplified (RISC-like) decoder that rides off the back of an existing v3.0B ISA

the former is a huge amount of work (both to design and implement) being effectively an embedding of an entirely new ISA within OpenPOWER

the latter is far less work, can be implemented as a no-state "mapping table" (every 16bit opcode maps directly and cleanly to a 32bit equivalent with very few gates) and requires near-trivial initial changes to gcc and binutils.

yet both provide high compression ratio.

which, by the assessment criteria that Paul Mackerras kindly informed us would be used by the OpenPOWER ISA WG, would stand the higher chance of being accepted?

the idea here is, like RISCV RVC, to go after the "low hanging fruit" with the highest bang-per-buck.  minisise effort and changes whilst maximising benefit.

a new (full) embedded CISC-like encoding maximises benefit but also maximises changes and complexity, as well as costs us time (and NLnet funding) which we really don't have.

 

> > if we knew in advance that instruction streams were to remain in 16bit mode
> > consistently for considerable durations i would be much more in favour of
> > schemes that had to rely on nops to transition between modes.
> 
> That's what I'm aiming for with my suggestions.

i do see where it's going.  the idea is, stay as long as possible in 16bit mode so that 32bit v3.0B is not needed.

unfortunately, if we had started this a year ago it might have been viable within the budget and time to explore.

rather than go the ARM thumb route i feel our efforts are better spent going the RISCV RVC route, which is based around letting 16 and 32 bit opcodes interleave even down to single-digit quantities.

if however it can be shown clearly that the fixed overhead of the 10 bits being effectively wasted is still a better compression ratio i will be more than happy with it.


> >> and follow the practice of other compact encodings of
> >> using a single register as input and output operand. 
> 
> > this was discussed in comment #18
> 
> Yeah, but dismissed without data.

really delighted you were able to get some.
 
> Consider that 22% of the instructions that take 3 registers as operands
> in the ppc-gcc binary I'm using for tests actually use the output
> register as one of the input operands, without any effort by the
> compiler to make them so:
> 
> $ ./objdump -d ppc-gcc | grep -c ' r[0-9]*,r[0-9]*,r[0-9]*$' # x <- y op z
> 5731
> $ ./objdump -d ppc-gcc | grep -c ' \(r[0-9]*\),\1,r[0-9]*$'  # x <- x op z
> 673
> $ ./objdump -d ppc-gcc | grep -c ' \(r[0-9]*\),r[0-9]*,\1$'  # x <- y op x
> 630

rrrigh.  ok.  so the next most critical question(s) is/are:

* of the x y op z type, how many of these do and do not fit into 16-bit compressed format?
* of the x x op z and x y op x format, likewise?

my concern is this: it's only 20%.  therefore even if 100% of those are candidates for 16bit compression, the maximum achievable compression ratio is:

   (0.8 * 4bytes + 0.2 * 2bytes) / 4bytes

whereas if we have a full 3-op 16bit compression system and by some fantastically weird fluke, 100% of those are compressible, it is:

   1.0 * 2bytes / 4bytes

i.e. 50%.

so we definitely need a further breakdown, because, ironically, without further data, intuition is in favour of 3-operand!

> We (well, I :-) can easily reconfigure the compiler to prefer such a
> form, without ruling out 3-different-register alternatives, and see how
> far that gets us,

ahh now *that* is valuable.   you're talking about actually changing ppc64 gcc to get it to emit a greater proportion of 2-op instructions on standard v3.0B Power ISA?

then if those proportions tip to say 20% 3op 80% 2op, the analysis can be done again.

the only thing is: if the 32bit recompilation increases executable size we have to do the comparison against the *unmodified* version of gcc.

if however it actually decreases executable size then apart from laughing so hard we forget to submit an upstream patch, the smaller executable size is our comparison.

:)


> >> Most (*) 3-operands
> >> insns x := y op z can be turned into x := y ; x op= z, which is no worse and
> >> probably better than switching to 32-bit mode,
> 
> > except: look at the amount of space used to do so.  it's still 32 bit, isn't
> > it?
> 
> Point was, even in the (to be exceptional) case of NOT getting an insn
> that could be encoded as a single compressed insn from the compiler, you
> could still remain in compressed mode.

... and this is what we need to check as a tradeoff.  got it.

> And by making the switching out of compressed mode rare enough, we'd
> strengthen the case for using the 2/16 bits for something more useful
> than switching modes: the switch out would be an exception, thus a
> special opcode rather than 1/8th of every insn.

this does mean some significant gcc and llvm changes to the compilation strategies.  we had better be damn sure it's a good idea before committing resources to it.


> >> and we could have an
> >> extend-next pseudo-insn to supply an extra operand and/or extra
> >> immediate/offset bits to the subsequent insn.
> 
> > i very much do not want to go down this particular route, although it is very
> > tempting to do so.
> 
> Think of it as a 32-bit insn representation if you must ;-)
> 
> Thumb does that.  We're looking at 48-bit instructions that are hardly
> different from that.

i know.  alarm bells are ringing at how much wotk is involved given everything else that needs doing, and that it is the "CISC" design route, for which we would have some serious pushback from the OPF ISA WG over.

remember, this has to be acceptable to IBM for inclusion in a future POWER11 or POWER12 core.

> I don't really feel that an extend-next opcode that supplies an extra
> operand and/or an extended immediate range for the subsequent insn that
> would otherwise take it, respectively, from a repeat operand, or from a
> sign-extension of a narrow immediate range is such a huge increase in
> complexity, and I believe the benefits could be huge in enabling us to
> remain in compressed mode for much longer.

vs being able to flip between the two without a fixed overhead penalty (at least, not a massive one), this is the crux of the debate.

prefixes on top of prefixes make me very nervous as far as the decoder hardware is concerned.

the FSM for the 10/16/32 is really quite straightforward.  entering the 10bit mode is simple: is 32bit Major Opcode == 0b000000.  after that, only 2 bits are involved in the decision-making.

what you are proposing is far more complex, involving the use of a "next-marker" plus adding a Special Purpose Register to store the state information (the immediate that hasn't yet been applied) and/or macro-op fusion analysis.

it involves analysing far more bits to ascertain the length, plus involves basically replacing the entirety of v3.0B because the incentive is there to do so.

it's a huge amount of work and it'S CISC and it goes strongly against everything that is good about RISC microarchitectures.

we intend to do a multi-issue core and that means that the identification of which instruction is which needs to be kept below a threshold of complexity.

CISC encoding is *really* challenging to do multi-issue and we have yet to include SV Prefixing (Vectorisation).

without having full implementation details at the hardware level available to show you, i really, really do not want to take us down the CISC encoding route.

if the extra bits for the regs can fit into a *uniform* 16 bit encoding, that is the RISC paradigm and it is infinitely preferable to the escape-sequence-based CISC one.

Comment 73 Alexandre Oliva 2020-11-23 21:32:14 GMT

re: comment 71

the diagram represents exactly what I propose, thanks.

the only catch is that I'm not sure about the byte order in BE and LE encodings (I read somewhere that it's not quite as obvious as the terms suggest), so instead of bytes 0 through 3, I'd have labeled them MmlL, M standing for most significant, L for least significant.

Comment 74 Alexandre Oliva 2020-11-23 22:10:49 GMT

re: comment 72

going 2-operand to have access to the entire register file doesn't change the nature of the direct mapping at all. when I speak of expanding the scope of the compressed isa into a fuller isa, I'm not talking of inventing new insns not present in 3.0B, nor borrowing insns from Thumb proper, I'm talking of getting more coverage of 3.0B insns.

e.g. 8 registers would to too little even if r0, r1 and r2 weren't reserved to begin with. it's a bit like going back to 32-bit x86, but with plenty of registers there, just not as usable because we're wasting representation bits with something else, so they require switching to uncompressed mode which we'd rather not do.

as for multi-issue, having encoding bits in every insn to tell how the very next one is to be interpreted doesn't help much with that. mode transitions had better be the exception rather than the rule.

OTOH, I can hardly tell the complexity difference between one bit that says "return to compressed mode after the next 32-bit insn" and one that says "use this register as the first input operand instead of the output one in the next insn". that's just a selector in the mapper, that would otherwise merely duplicate the output register.

it's one bit that affects decoding of a subsequent instruction one way or another. but by avoiding this escaping, we instead force the more frequent use of another escaping mechanism, namely switching to uncompressed mode. it's just that with this one, we ensure that *every* occurrence of a 3-operand insn (among the selected opcodes) can be represented in 2-operand compressed mode, even those that can't be made up for with a register-copy insn.

also consider that, unless I'm missing something about the ppc encoding, using the full 16 bits and 2 5-bit operands makes the mapping much easier: 6 bits for the EXT opcode, plus the 2 operands makes 16; in many cases, the mapping could just directly copy 11 bits directly, without any mapping whatsoever; the intelligence would have to go into where/how to copy the remaining bits, and in whether to duplicate the first operand into one of the input fields.

is that not much much simpler, efficient and likely to be accepted than not just one, but two new encodings? (namely 10- and 14-bit AKA 16-bit?)

Comment 75 Luke Kenneth Casson Leighton 2020-11-23 22:19:26 GMT

(In reply to Alexandre Oliva from comment #73)
> re: comment 71
> 
> the diagram represents exactly what I propose, thanks.

i like diagrams. actually, i typically only grasp complex concepts after seeing them expressed in 2 different ways.

anyway.  next on the TODO list, i afded that to the same wiki page (further down)

> the only catch is that I'm not sure about the byte order in BE and LE
> encodings (I read somewhere that it's not quite as obvious as the terms
> suggest), so instead of bytes 0 through 3, I'd have labeled them MmlL, M
> standing for most significant, L for least significant.

ngggh welcome to POWER :) the industry-standard term you are looking for is, i believe, called "MSB0".  there's a wikipedia page which describes it.

Comment 76 Luke Kenneth Casson Leighton 2020-11-23 23:18:17 GMT

(In reply to Alexandre Oliva from comment #74)

>   when I speak of expanding the
> scope of the compressed isa into a fuller isa, I'm not talking of inventing
> new insns not present in 3.0B, nor borrowing insns from Thumb proper, I'm
> talking of getting more coverage of 3.0B insns.

yes.  sorry for not making it clear that i understood that this was the context.

VLE i gather from looking at the way it does implement something similar to what you propose (extend 16bit opcodes by another 16 bits which contain an immediate) *might* i stress might not map onto v3.0B (or in VLE's case because it was designed 10+ years ago v2.06B) opcodes.
 
> e.g. 8 registers would to too little even if r0, r1 and r2 weren't reserved
> to begin with. 

it may surprise you to learn that RISCV RVC is highly effective, achieving something like a 25% reduction.  also, hilariously, someone did an experimental 16bit re-encoding called RV16 (or something like it) that they demonstrated achieved a whopping 40% reduction.

> it's a bit like going back to 32-bit x86, but with plenty of
> registers there, just not as usable because we're wasting representation
> bits with something else, so they require switching to uncompressed mode
> which we'd rather not do.

you mean, "the proposed alternative encoding (v2) has as its premise the avoidance of switching as a driving design goal".

> as for multi-issue, having encoding bits in every insn to tell how the very
> next one is to be interpreted doesn't help much with that.  

if they are very simple and do not involve the HDL equivalent of "deep packet inspection" then yes, they do.

the moment that this inspection becomes a complex layered FSM, with multiple unavoidable gate dependencies, this automatically and inherently means that the top clock speed is limited.

example: to achieve a 5 ghz clock rate you must have no more than around 16 gates in any given "combinatorial cascade" before capturing partial results in "latches" that are then passed on to the next "stage" in the "pipeline".

in other words, the 10/16/32bit FSM, being only involving 2 bits, *not* having to go further in and analyse any more bits, can indeed identify "packets" very easily, precisely because it is only 2 bits.

> mode transitions
> had better be the exception rather than the rule.

you'll need to trust the 4 years i've spent doing HDL development, here:

* 2-bit FSM: fine for multi-issue
* op-next chained CISC encoding: not fine

:)


> OTOH, I can hardly tell the complexity difference between one bit that says
> "return to compressed mode after the next 32-bit insn" and one that says
> "use this register as the first input operand instead of the output one in
> the next insn".  

the first one is a pre-analyser that needs know absolutely nothing about the rest of the bits of the instruction.  the remaining bits, having been identified, can be passed to secondary (parallel) processing analysis units.

that secondary processing branches out, based on information it received from the 1st phase, "this is 10 bit" or 16 or 32.

the other one *cannot do that*, it has to do "deep packet inspection", analysing far more bits before being able to "hand off" to other processing analysis units.




> it's just that with this one, we ensure that *every* occurrence of a
> 3-operand insn (among the selected opcodes) can be represented in 2-operand
> compressed mode, even those that can't be made up for with a register-copy
> insn.

if only the PowerISA were that simple.

we have some pipelines in the LibreSOC codebase with as high as **SEVEN** incoming and **FIVE** outgoing registers.

not all of those are active at the same time, however it's pretty close.

please, i love the idea however i also have a better "feel" for how much combined design, HDL and toolchain work is involved in the CISC escape-sequence approach, and i really, really do not think it is a good idea to commit the available resources from NLnet to it.


> also consider that, unless I'm missing something about the ppc encoding,
> using the full 16 bits and 2 5-bit operands makes the mapping much easier: 6
> bits for the EXT opcode, plus the 2 operands makes 16; 

the point is, here: this is effectively shuffling the v3.0B encoding space around, just to reach 32bit parity with an existing long-established encoding (v3.0B).

the time taken would be enormous. remember: OpenPOWER has 300+ integer instructions, 300+ FP ones and a staggering 700+ SIMD ones

then there are at least *SIX* separate and distinct register files (!), INT FP CR STATE MSR XER and other SPRs

i am trying to give you some idea, without going into too much detail, of the implications time-wise of the "simple-sounding" idea of re-encoding the entirety of the OpenPOWER ISA.

it is a massive rabbithole and timesink that would easily justify its own EUR 50,000 NLnet Grant, and could take most of a year to complete.

whereas a subset encoding we can just about justify by selecting the top 10-15% instructions, demonstrate that this gives us a 25% (whatever) compression ratio, implement it, declare the NLnet Grant milestone "complete" and be in a position to apply for another one.



> in many cases, the
> mapping could just directly copy 11 bits directly, without any mapping
> whatsoever; the intelligence would have to go into where/how to copy the
> remaining bits, and in whether to duplicate the first operand into one of
> the input fields.
> 
> is that not much much simpler, efficient and likely to be accepted than not
> just one, but two new encodings?  (namely 10- and 14-bit AKA 16-bit?)

that's what i would like to determine... but *not* by re-encoding the entirety of the OpenPOWER ISA, and definitely not by using CISC-style variable-length encodings...

...*unless* those variable-length encodings are using as an absolute maximum one maybe 2 *uniform* bits to identify.

* acceptable:

     if instr[0:2] == 0b00 then
          length = 32

   gate chain depth here is around 2

* also acceptable:

     if FSM.mode==X & instr[0]==0b1 then
          FSM.next.mode = Y
          length = 10
     elif FSM.mode==Y ....

   gate chain depth is also around 2

* not acceptable:

     if instr[0:4] == 0b00110 then
         if something else then
              if something else from
                 somewhere else:
                     length = 32

   gate chain here is very high and will
   jeapordise chances of high performance
   multi-issue.

Comment 77 Jacob Lifshay 2020-11-24 00:07:51 GMT

While going through the PowerISA v3.1 spec, I noticed that we can't really use Primary Opcode 0 for compressed instructions, since Power reserves Primary Opcode 0 with Extended Opcode 256 to mean "Service Processor Attention", an instruction implemented by at least POWER8 and POWER9. This means that processors would have to read a 32-bit instruction to determine if the instruction was that instruction or not, so we can't redefine it to be a 16-bit instruction.

If we end up needing a whole primary opcode, we should use Primary Opcode 5 instead, since it doesn't have any already-used encodings.

Comment 78 Alexandre Oliva 2020-11-24 00:40:33 GMT

re: comment 76

let me straighten out and separate some ideas so that they're not conflated:

1. use 2-operand encoding in compressed mode, and modify the compiler to give preference to such insns

2. simplify the mapping by using the same 6-bit opcode encoding that uncompressed mode uses (even if we don't map every opcode; just make those that we do the same)

3. use 1-byte mode-switching instructions

3 implies we'd have to look at all opcode bits to tell how to interpret the next byte in the stream, i.e., whether it's the second half of a 16-bit insn, or the opcode byte of a 32-bit insn (which might be a switch-back, but I disgress)

I don't see how that's worse than having an opcode that sets one bit for the operand mapper of the next insn, and saves some 8, 9 or 10 bits for the decoder/mapper to use, presumably as operands, when decoding the next insn.

if you think of this pair as a single insn you'd like to process in a single cycle, yeah, I can see how that would put pressure in the decoder and in the pipeline, and make it very CISCish

but if you conceive of it as a mostly-separate instruction, then it's not that bad.  taking an extra cycle to process the pair is not so bad compared with the alternatives:

- issue an extra move before (par)

- switch to 32-bit mode, perform the operation, switch back (lose)

that said, looking at the insns available through primary opcode encoding, and how many resort to extended encoding, I come out leaning towards agreeing that using that direct mapping only might give us too little, and that we might have to switch back to uncompressed mode far more often than I was hoping for

Comment 79 Alexandre Oliva 2020-11-24 02:15:59 GMT

I've just completed my first skimming over the entire 3.0 ISA.

The major mistake that I realized I made, and that I'm about to correct, is that I wrote in comment 58, that Luke kindly moved here from bug 529, that the primary opcode was in the MS byte, but it's actually in the LS byte.

So the diagram you made, Luke, presumably based on that incorrect information, should have bytes 0 to 3 mapped to LlmM rather than MmlL.

Comment 80 Alexandre Oliva 2020-11-24 02:50:12 GMT

re: comment 78

Though I discussed the topic, I missed explicitly stating it as a mostly orthogonal proposal:

4. have an insn kind that provides additional operands for the subsequent insn

Comment 81 Jacob Lifshay 2020-11-24 05:50:58 GMT

(In reply to Alexandre Oliva from comment #79)
> I've just completed my first skimming over the entire 3.0 ISA.
> 
> The major mistake that I realized I made, and that I'm about to correct, is
> that I wrote in comment 58, that Luke kindly moved here from bug 529, that
> the primary opcode was in the MS byte, but it's actually in the LS byte.

The Power ISA has the Primary Opcode field in bits 0 through 5 -- the MSB end. The Power ISA confusingly labels bits starting from bit 0 at the MSB end through to bit 63 at the LSB end (assuming 64 bits).

Most of Libre-SOC's Python/nmigen code instead uses the more usual bit numbering of 0 at the LSB end up through 63 at the MSB end.

One other confusing difference is that the PowerISA spec uses inclusive ranges, yet nmigen/Python/Rust/C++/... all use half-open ranges by default:
low <= element < high

Comment 82 Luke Kenneth Casson Leighton 2020-11-24 07:15:07 GMT

(In reply to Jacob Lifshay from comment #81)

> One other confusing difference is that the PowerISA spec uses inclusive
> ranges, yet nmigen/Python/Rust/C++/... all use half-open ranges by default:
> low <= element < high

and if you want people to witness a mushroom cloud coming out the top of your head: conversion of microwatt's vhdl code has inclusive hi downto lo but bitconcatenation is expressed in the inverse order to python but numerical constants are not.

https://www.youtube.com/watch?v=UIKGV2cTgqA

Comment 83 Luke Kenneth Casson Leighton 2020-11-24 18:36:58 GMT

(In reply to Jacob Lifshay from comment #77)
> While going through the PowerISA v3.1 spec, I noticed that we can't really
> use Primary Opcode 0 for compressed instructions, since Power reserves
> Primary Opcode 0 with Extended Opcode 256 to mean "Service Processor
> Attention", an instruction implemented by at least POWER8 and POWER9. This
> means that processors would have to read a 32-bit instruction to determine
> if the instruction was that instruction or not, so we can't redefine it to
> be a 16-bit instruction.

we can... by providing it as one of the 16-bit opcodes.  then when C mode
is activated the command is still available.

ahh this is the "attn" instruction.

> If we end up needing a whole primary opcode, we should use Primary Opcode 5
> instead, since it doesn't have any already-used encodings.

Compressed really does need to be at the very least on EXT000 because the encoding of "illegal instruction" needs to be all-zeros.  it would be... anomalous / concerning for a Compressed ISA to not have 0x0000 as "illegal".

Comment 84 Luke Kenneth Casson Leighton 2020-11-24 19:08:33 GMT

(In reply to Alexandre Oliva from comment #80)
> re: comment 78
> 
> Though I discussed the topic, I missed explicitly stating it as a mostly
> orthogonal proposal:
> 
> 4. have an insn kind that provides additional operands for the subsequent
> insn

(i'll come back to the other ones, which are good / worthwhile)

this one is so detrimental to the efficiency of a hardware implementation, that the detriment is sufficient to eliminate it entirely.

the absolute top priority job of an instruction hardware decoder is:

     identify the instruction length.

this job above all others is the absolute, without a shadow of doubt, the
absolute top highest priority above and beyond all and any other.

in a standard 32-bit RISC instruction set such as OpenPOWER v3.0B this is
an absolute brain-dead task:

     word0 word1 word2 word3 word4 ....

maps to:

     instr0 instr1 instr2 instr3 instr4 ...

however now let us introduce OpenPOWER v3.1B which includes 64-bit instructions.  let us also presuppose that we want a 4-way issue.

that means that we can either have at the minimum or the maxium the following:

* all 32-bit opcodes
* all 64-bit prefixed opcodes

that means we need to have a buffer of up to 8x 32-bit words in memory.

now let us have a series of "analysers" which tell us, for each instruction, the answer to the following question:

    does word[N] have a Major Opcode EXT001?  (this is the "prefix" identifier)

note: we haven't actually determined if it's *actually* an instruction.  the only one we KNOW is an instruction is:

    word[0]

so, we have an array of 8 words:

    word[0] [1] [2] [3] [4] [5] [6] [7]

and we have an array of 8 bits which tell us "bits 0..5 equals 0b000001"

     P0 P1 P2 P3 P4 P5 P6 P7

now let us suppose that we have those set to the following values:

     0 0 0 0 0 0 0 0

what does that tell us?  it means, ta-daaa, the 1st 4 words in our buffer are *NOT* 64-bit prefixed instructions.

therefore, in this case, we can take the 1st 4 words, shove them into instruction-processing-units 0, 1, 2 and 3 respectively, along with their lengths: 32-32-32-32

now let us suppose that we have the following values for P0-P7:

     1 0 1 0 1 0 1 0

what that tell us?  it means, ta-daaaa, this is 4x 64-bit prefixed instructions.
in fact, anything according to this pattern is *also* 4x 64-bit prefixed:

     1 X 1 X 1 X 1 X

because the "data" in the 2nd word of each 64-bit instruction could contain the pattern "0b000001" in bits 32-37

therefore, anything matching that pattern "1x1x1x1x" we shove straight into the 4 buffers, along with length-identification pattern:

    64-64-64-64

you can see where that's going.  you basically have a big "Switch" statement covering *all* possible patterns, producing all the permutations:

    32-32-32-32
    64-32-32-32
    32-64-32-32
    64-64-64-64

and additionally the *INDEX*, on a per-target-buffer-basis, of where the 32/64 packet is to be routed to, that target being the multi-issue "instruction" decoder, indexed 0,1,2 or 3.

note:

    AT NO TIME WAS THERE ANY FURTHER INSPECTION OF THE PACKETS OTHER THAN
    "IS BITS 0 THRU 5 EQUAL to 0b000001"

now let's f*** with this utterly simple brain-dead algorithm.  let's throw in some variable-length encoding.  let's add the proposed 16-32 encoding.

let's say that if the 16-bit opcode (in let's say bits 0 thru 3) are equal to 0b1111, that this indicates "escape sequence to decode as a 32-bit instruction"

our detection pattern now comprises:

   * an array of 16x identifiers of the 1st 6 bits of every
     16-bit chunk to the pattern 0b00000
   * an additional array of 16x identifiers of the 1st 4 bits of every
     16-bit chunk to the pattern 0b1111

followed by a MASSIVELY complex permutation network of Switch/Case statements,
potentially several thousand in size, that is at least two orders of magnitude
larger than our simple brain-dead identification algorithm in terms of gate count.

if we try to do this as a chained detection system (a for-loop in hardware),
the gate chain will be SO LONG that our maximum clock rate even in 7nm would probably be limited to somewhere around the 800mhz mark.

so i have to reiterate: it is CRIICALLY important that we DO NOT do a CISC
encoding that involves "deep packet inspection".

the FSM that i came up with for the 10/16-bit encoding is a **POST** length-analysis algorithm that is isolated to that particular local instruction, that does *NOT* in any way interfere with the detection of the packet "chunk" length.

it so happens that there *is* chaining between each of the packets (after the independent length-based post-processing) however each 10/16-bit FSM, because it is only 2 bits, is a really, REALLY simple decoding that involves 1 gate per "level" of dependence.

1 gate delay between each is perfectly acceptable, and an optimising HDL compiler would likely come up with a better solution that turns the analysis of the chain-of-FSMs into only 3 maybe even as low as 2 gates of delay... for the *entire* 4-way multi-issue identification of 4 10/16-bit instructions.

...

i just realised, through writing this out (which is always good) that the above isn't strictly true.  we do actually need 1 bit to tell us that the next instruction is back to "32 bit mode".

however - and this is the kicker - *it's one bit*.  it's not "4 bits"
or "2-3 permutations of 4 possible lots of bits"

that means, darn-it, we may have to actually prototype this in HDL (very
simple HDL) to see what it actually looks like.

Comment 85 Luke Kenneth Casson Leighton 2020-11-24 19:35:19 GMT

(In reply to Luke Kenneth Casson Leighton from comment #84)

> i just realised, through writing this out (which is always good) that the
> above isn't strictly true.  we do actually need 1 bit to tell us that the
> next instruction is back to "32 bit mode".

i also just realised that in proposal v2 encoding, the 8bit nop is likewise the FSM  "return to 32bit mode" indicator that has a knock-on (chain) effect on subsequent instructions.

however - i stress and emphasise: adding the 16+16-immed/reg encoding turns what would otherwise be a dead-straight "is there an 8bit nop pattern" detection into a far more complex system that's already borderline.

sigh i wish OpenPOWER had been designed with 16bit in mind right from the start.

Comment 86 Luke Kenneth Casson Leighton 2020-11-25 14:38:57 GMT

(In reply to Alexandre Oliva from comment #78)
> re: comment 76
> 
> let me straighten out and separate some ideas so that they're not conflated:
> 
> 1. use 2-operand encoding in compressed mode, and modify the compiler to
> give preference to such insns

yep i like this one.

> 2. simplify the mapping by using the same 6-bit opcode encoding that
> uncompressed mode uses (even if we don't map every opcode; just make those
> that we do the same)
> 
> 3. use 1-byte mode-switching instructions
> 
> 3 implies we'd have to look at all opcode bits to tell how to interpret the
> next byte in the stream, i.e., whether it's the second half of a 16-bit
> insn, or the opcode byte of a 32-bit insn (which might be a switch-back, but
> I disgress)

with jacob kindly going through the analysis of a multi-issue length/mode
identifier, i realised just now that the proposed 8-bit alignment concept
has a fundamental flaw: extracting the identified instructions out of the
shift register which contains the raw instruction stream will require *8-bit*
indexing.

for a 4-way multi-issue, which will have up to 4x8=32 bytes of data in
the "raw" shift register, that would imply an absolutely insane 8x32-way 
8-bit parallel multiplexer to get any one of the 32 bytes into their respective targetted 4x 64-bit parallel 2nd-stage decoder buffers.

that's 8192 MUXes which are.. what... 5 gates each?  that's a whopping 40,000 gates which, to give you some idea of scale, alexander, is 3x times larger than a 64-bit multiplier.

even a 16-bit encoding (aligned at 16 bits) is going to be 4x16-way 16-bit
parallel multiplexing however with the reduction in both dimensions that's an O(N^2) drop (down to "only" 10k gates) which is "just about borderline acceptable".

bottom line i think we eliminate the 8-bit-aligned concept on the basis that it's simply far too costly, much as i liked the idea of using the LSB of the PC as an encoding-identifier.

that leaves just simply aligning to 2-byte (hword) boundaries... which of course leaves no space for a marker unless it's a 16-bit encoding... and after a few more rounds of logical deductive reasoning we're back at the 10/16-bit encoding scheme.

Comment 87 Jacob Lifshay 2020-11-25 22:36:01 GMT

(In reply to Luke Kenneth Casson Leighton from comment #86)
> with jacob kindly going through the analysis of a multi-issue length/mode
> identifier, i realised just now that the proposed 8-bit alignment concept
> has a fundamental flaw: extracting the identified instructions out of the
> shift register which contains the raw instruction stream will require *8-bit*
> indexing.
> 
> for a 4-way multi-issue, which will have up to 4x8=32 bytes of data in
> the "raw" shift register, that would imply an absolutely insane 8x32-way 
> 8-bit parallel multiplexer to get any one of the 32 bytes into their
> respective targetted 4x 64-bit parallel 2nd-stage decoder buffers.

Do we need to decode 4 64-bit instructions per cycle? can we just have it decode up to 4 instructions or up to 16 bytes of instructions, which ever is smaller per cycle?

> that's 8192 MUXes which are.. what... 5 gates each?  that's a whopping
> 40,000 gates which, to give you some idea of scale, alexander, is 3x times
> larger than a 64-bit multiplier.
> 
> even a 16-bit encoding (aligned at 16 bits) is going to be 4x16-way 16-bit
> parallel multiplexing however with the reduction in both dimensions that's
> an O(N^2) drop (down to "only" 10k gates) which is "just about borderline
> acceptable".

Another way we could structure the decoder is: rather than using a giant shifter/aligner to compact the decoded instructions into a contiguous list, instead, we could just have overlapping decoders that start at each 16-bit offset and we just ignore the decoded instructions for those decoders that didn't start at the right offset. This totally avoids the need to have the aligning matrix at the cost of having much wider issue, which might be worth it.

Comment 88 Luke Kenneth Casson Leighton 2020-11-25 23:52:51 GMT

(In reply to Jacob Lifshay from comment #87)

> > for a 4-way multi-issue, which will have up to 4x8=32 bytes of data in
> > the "raw" shift register, that would imply an absolutely insane 8x32-way 
> > 8-bit parallel multiplexer to get any one of the 32 bytes into their
> > respective targetted 4x 64-bit parallel 2nd-stage decoder buffers.
> 
> Do we need to decode 4 64-bit instructions per cycle?

mmm good question.  let's think it through

* SV-P64 - the probability is quite high that it will be chucking out
  a large batch of (multi-issue, OoO) operations all on its own.  i.e.
  VL=4 or VL=8

* still SV-P64 - however if those are VL=8 and elwidth=8-bit, actually
  it'll fit into a single 64-bit SIMD pipeline.

  therefore, answer: yes, we do want to be able to decode multiple SV-P64
  instructions, otherwise there's an unfair penalty to SV-P64 that will
  discourage its use.

* v3.1B 64-bit prefix - these are "just like 32-bit except happen to have
  larger immediates" - and consequently are one-only

  therefore, answer: yes, really, there is a strong incentive.

* SV-C64-swizzled (this is the concatenation of:
  - same prefix as SV-P48 plus
  - 16-bit swizzle data plus
  - any arbitrary (appropriate) 16-Compressed instruction

  here it can be VL-looped (because the SV-P48 11-bit prefix can mark
  things as VL-Vectorised) but in cases where that doesn't happen it's
  just "yet another scalar instruction"

  therefore, answer: yes, again, scalar SV-C64-swizzled has an incentive
  to be multi-issue

in all cases i'd conclude that there's a strong incentive to allow multi-issue execution of 64-bit encoded instructions.

>  can we just have it
> decode up to 4 instructions or up to 16 bytes of instructions, which ever is
> smaller per cycle?

16 bytes being able to fit 2x 64-bit, 4x 32-bit or 8x 16-bit, yeah that would keep the gate-count down to "a little less insane" and would be much more manageable.

the only thing being, we'd have to make sure to keep an eye on the IPC (instructions per clock), to make sure it's acceptable.  if certain encodings get penalised, that's a definite "minus" because it will strongly discourage their uptake / usage.

> > that's 8192 MUXes which are.. what... 5 gates each?  that's a whopping
> > 40,000 gates which, to give you some idea of scale, alexander, is 3x times
> > larger than a 64-bit multiplier.
> > 
> > even a 16-bit encoding (aligned at 16 bits) is going to be 4x16-way 16-bit
> > parallel multiplexing however with the reduction in both dimensions that's
> > an O(N^2) drop (down to "only" 10k gates) which is "just about borderline
> > acceptable".
> 
> Another way we could structure the decoder is: rather than using a giant
> shifter/aligner to compact the decoded instructions into a contiguous list,
> instead, we could just have overlapping decoders that start at each 16-bit
> offset and we just ignore the decoded instructions for those decoders that
> didn't start at the right offset. This totally avoids the need to have the
> aligning matrix at the cost of having much wider issue, which might be worth
> it.

i like the principle: sadly though, the PowerDecoder2 for OpenPOWER opcodes is so insanely large (4k gates and that's just for the 150-or-so integer operations) that it's not really feasible.

i'd have to double-check given that it's now subdivided into "Subset Decoders" but i'd be really *really* surprised if it was small enough to be multipliable.

OpenPOWER is not like MIPS or RISC-V, the encoding switch-case cascade is unfortunately absolutely massive.

plus, the algorithm for which instructions go into which actual "buckets" still has yet to apply SimpleV!  we have *another* phase to add into the middle, in between the length/mode decoder and the "multi-issue buckets" which is the VL for-loop!

and that for-loop, remember, needs to be subdivided by (64/elwidth) to actually determine the SIMD quantity and the corresponding auto-predication-mask (which deals with cases where VL=5 or VL=3).

once the SIMD-if-i-cation is carried out *finally* actual "element" instructions can be dropped into the "real" multi-issue buckets.

Comment 89 Alexandre Oliva 2020-11-29 04:16:33 GMT

I had not noticed before writing comment 16 in bug 532, but the 16-bit immediate compressed encoding is very very similar, in function and possibilities, to the extend-next-insn opcode I'd suggested before.

It could very well be used to significantly extend the range of registers accessible in compressed mode, whether or not we go for 2-reg encoding in 16-bit insns, which is just what I'd hoped to accomplish with an extend-next opcode.

This bit can even be used to make room for some extended opcodes if there are any compressed primary opcodes that have no use for additional operands, e.g. nop or unary operations, and even to reserve room for future extensions.

Comment 90 Alexandre Oliva 2020-11-29 04:33:46 GMT

Here's a potential conflict, and another opportunity for an escape encoding for future extensions, that just occurred to me while getting my brains fried :-) trying to figure out how to decide the best way to compress each insn as it goes by, with zero look-ahead:

if we are in 16-bit mode and encode an insn that switches to 32-bit for one insn, and then we find a 10-bit insn, we could:

- rule that out, since it could have been encoded as a 16-bit insn (every 10-bit one can, right?)

- accept it, but require the 10-bit insn to state the next insn is to be in 16-bit mode, so as to agree with the 1-insn switch from the previous 16-bit insn

- accept it, and specify which of the settings prevails


Another occurrence we may want to ponder whether to allow is a pair of adjacent 10-bit insns, the first of them saying the next insn is a 32-bit one.  Though a 10-bit insn can be encountered whenever a 32-bit insn is expected, tt makes very little sense, if my assumption is correct that every 10-bit insn can also be encoded as 16-bit.  We might get some simplifications (and opportunities for future extensions) by ruling such pairs (or long sequences) out.

Comment 91 Jacob Lifshay 2020-11-29 04:45:58 GMT

(In reply to Alexandre Oliva from comment #90)
> Another occurrence we may want to ponder whether to allow is a pair of
> adjacent 10-bit insns, the first of them saying the next insn is a 32-bit
> one.  Though a 10-bit insn can be encountered whenever a 32-bit insn is
> expected, tt makes very little sense, if my assumption is correct that every
> 10-bit insn can also be encoded as 16-bit.  We might get some
> simplifications (and opportunities for future extensions) by ruling such
> pairs (or long sequences) out.

I'd argue that two 10-bit instructions in a row should be legal, since there could be a branch to the second 10-bit instruction like the following:

h.add r5, r3
loop:
h.add r5, r4
...
b loop

If we didn't allow 2 10-bit instructions in a row, we'd have to change 1 of them to 32-bit or add a nop or something -- wasting space.

Comment 92 Alexandre Oliva 2020-11-29 06:10:47 GMT

> that's 8192 MUXes which are.. what... 5 gates each?  that's a whopping 40,000 gates which, to give you some idea of scale, alexander, is 3x times larger than a 64-bit multiplier.

fortunately, that's *way* exaggerated.

that's because (i) the shifters for the earlier instructions only have to look at initial bytes, and (ii) the shifters for the latter instructions only have to look at the few actual possibilities that can be reached by prior-insn combinations

Say we're looking at a buffer of 32 bytes, and instructions can be at most 8-bytes long.  Can one of the 4 instructions we're trying to process start at byte 25, or any later byte?  Certainly not.  The latest potential starting position is byte 24, and that's only reachable at all if the opcodes at 0, 8 and 16 encode 64-bit instructions.

Odd bytes after 17 don't even have to be looked at.  byte 15 doesn't either.

And then, the even/odd distinction enables us to further reduce that ballpark, because it reduces potential encoding ambiguities between 16- and 32-bit opcodes, which I believe reduces the number of gates.  IOW, tt's not so obvious to me that it actually makes things worse.

But yeah, it's a big muxer, with 257 inputs (the initial execution mode matters) and 256 outputs (assuming compressed insns are mapped to their uncompressed versions by this very muxer, otherwise add a mode bit to each insn for the remapper to know whether it needs further decoding).

Alteranatively to 257 inputs, we might keep compressed-mode insns always at odd bytes in the input to the muxer, by keeping a mode-switching nop as the first byte in the input buffer as long as we're in compressed mode.  The muxer could very well manage to ignore it entirely, instead of filling in the first decoded slot with a nop.  Indeed, we could configure the muxer to skip however many mode-switching nops as it makes sense, though each additional one adds some gates, unlike (ISTM) the first one.

All that said, I haven't done Karnaugh map minimizations for hardware purposes in a long time, and never such massive ones, so my intuitions may be way off.  I'd have to work it out more thoroughly to trust them ;-)

Comment 93 Luke Kenneth Casson Leighton 2020-11-29 13:24:16 GMT

(In reply to Alexandre Oliva from comment #89)
> I had not noticed before writing comment 16 in bug 532, but the 16-bit
> immediate compressed encoding is very very similar, in function and
> possibilities, to the extend-next-insn opcode I'd suggested before.

what you propose sounds perfectly reasonable for single-issue designs of
10-20 years ago.

i really, really do not want this proposal to go down the route of
adding yet more complexity at the 16-bit level, re-expanding length
based on deep-dive multi-bit selections:

* first level EXTNNN selection analysis
* second level FSM selects 16-bit
* third level selects bit0=1,bit15=15 to indicate compressed.immediate mode
* FOURTH level BACK to 32-bit (or greater) based on multi-bit brownfield
  encoding patterns?

no.  this is too much.

the complexity is too great for high performance multi-issue designs
to withstand.

length-mode decoding really must be kept to the absolute bare minimum, with
no "bleed-back" between 1st pipeline stage decode and 2nd stage.

SV is already going to require a *THREE* stage instruction decode.

Comment 94 Luke Kenneth Casson Leighton 2020-11-29 13:30:16 GMT

(In reply to Alexandre Oliva from comment #92)

> All that said, I haven't done Karnaugh map minimizations for hardware
> purposes in a long time, and never such massive ones, so my intuitions may
> be way off.  I'd have to work it out more thoroughly to trust them ;-)

:)

whatever it is, it's an order of magnitude too large.  8-bit shifting
is inherently going to be 4x larger than anything that's involving 16-bit
shifting.  with the amount of in and outs, inherently it's going to be
too much.

this is the *decoder* phase we're referring to, here.  partial-decoded
instructions should normally be stored in 1R1W SRAMs with clean, clear subdivisions: we're looking at 4R4W (which is insanely costly).

bottom line, 8-bit aligment of instructions is eliminated from consideration
(the 8-bit marker and using PC[0] as state).

Comment 95 Luke Kenneth Casson Leighton 2020-11-29 13:43:14 GMT

btw keep it coming, alexandre - the ideas you come up with may contain insights that others missed.

Comment 96 Alexandre Oliva 2020-11-29 22:07:41 GMT

I'm surprised/confused by some proposed encodings in https://libre-soc.org/openpower/sv/16_bit_compressed/ that affect my attempt at implementing an accurate compression estimator.

- is it really the case that an operation as common as 'mr' can only be encoded in compressed mode using an 16-bit + 16-bit immediate encoding?  (I gather it is, because all of the Immediate opcodes have the N and M bits set to 1)

- or are encodings that don't have N or M explicitly named as such ones that use those bits for other purposes?  i.e., the immediate opcodes, despite having those bits both set, are 16-bit insns without an extra 16-bit immediate, and also without an opportunity to switch modes for the next insn?

The logic I built into the estimator went by the instruction formats in 2.1.1 and general directions in 2.1, but as I get to the proposed opcodes, things don't quite seem to fit.  Help?

Comment 97 Alexandre Oliva 2020-11-29 22:16:00 GMT

> what you propose sounds perfectly reasonable for single-issue designs of
10-20 years ago.

Err, what you appear to be objecting now is to the 16-bit N=M=1 encoding for a 16-bit immediate, which is not something I propose.

Considering that the monster alignment muxer has to deal with it (if we adopt that sort of encoding), but it's not responsible for the translation from 16-bit to 32-bit instructions, I don't see that the alignment muxer would care at all about what's in the 16-bit extension.  Therefore, the remapper could very well use those bits for purposes other than just an extension of the varying-width immediates already present in several of the immediate opcodes.

So, are you saying we can't afford to have the 16+16-bit immediate encodings with N=M=1, or that there's some other reason, unrelated to the alignment muxer, to avoid altogether using compressed-mode 16+16-bit instructions that use the extension for purposes other than just extension of immediate operands?

Comment 98 Jacob Lifshay 2020-11-29 22:41:34 GMT

(In reply to Alexandre Oliva from comment #96)
> I'm surprised/confused by some proposed encodings in
> https://libre-soc.org/openpower/sv/16_bit_compressed/ that affect my attempt
> at implementing an accurate compression estimator.
> 
> - is it really the case that an operation as common as 'mr' can only be
> encoded in compressed mode using an 16-bit + 16-bit immediate encoding?  (I
> gather it is, because all of the Immediate opcodes have the N and M bits set
> to 1)

The proposed instructions there are an initial guess, IIRC we plan to rearrange/modify the supported compressed instructions to actually follow the observed instruction statistics. as part of that, `mr` is highly likely to be included in the final instruction set.

Comment 99 Luke Kenneth Casson Leighton 2020-11-30 01:01:50 GMT

(In reply to Alexandre Oliva from comment #97)
> > what you propose sounds perfectly reasonable for single-issue designs of
> 10-20 years ago.
> 
> Err, what you appear to be objecting now is to the 16-bit N=M=1 encoding for
> a 16-bit immediate, which is not something I propose.

ah then that is a misunderstanding.  what i understood you to have said is, at the length-mode decode phase, embedded somewhere in the FSM:

   if N=M=1 # 16bit.imm mode
       # but wait... there is more
       # complexity
       if bits[A:B] == pattern1  or
            pattern2, 3, 4, 5 or 10
          # arbitrary suite of encodings
          # changes length and mode
          length = 32
          mode = compressed.something
       elif
       elif
       else
          length = 16
          mode = 16.imm

this level of complexity is just not going to fly.

however the N=M=1 mode (without additiobal complexity which is deginitely not acceptable) because it is just 2 bits is on the borderline of acceptable.



> Considering that the monster alignment muxer has to deal with it (if we
> adopt that sort of encoding), but it's not responsible for the translation
> from 16-bit to 32-bit instructions, I don't see that the alignment muxer
> would care at all about what's in the 16-bit extension.  

if as above pseudocode there exists a fourth level of decoding affecting the determination of length and mode that changes the FSM entirely to one where a simple hierarchy is no longer possible, merging the two phases (see below) this is unacceptable.


> Therefore, the
> remapper could very well use those bits for purposes other than just an
> extension of the varying-width immediates already present in several of the
> immediate opcodes.

.... "varying width immediates"... ah i think i know what you are referring to.  you are referring to the fact that some subencodings of MN=1 use different length portions of the 16 bits to specify an immediate?

i thought that you had said, "some subencodings would affect the FSM of the INSTRUCTION length, back 2 stages ago in the pipelined instruction decoder".

right.

mentally you need to separate out the two following distinct phases:

    phase 1: raw instruction length/mode detection

    phase 2: operand, register and imm decoding.
             (these are different per
              length per mode obviously
              and also have further
              subdecoding)

phase 1 because it is the raw data stream has to be as drastically simple as possible, especially and particularly because of a multi issue decode context.

once that barrier is passed, and multiple INDEPENDENT instructions are passed to parallel processing units, given that those are independent and have local access to all state they need, there is far less pressure.

if it takes 4,000 gates to decode a given instruction (because yes, OpenPOWER really is that complex, that *is* how many gates it takes) then so be it but bear in mind this is **AFTER** length/mode has been identified.

so.

back at the 16.imm (MN=1), having different subencodings select different bits to fill out an immediate? this is pretty normal for OpenPOWER.  have a look for example at the shift instructions section, in v3.0B PDF.  compare it to addi and you see the fields (immediates) are not only completely different lengths, they're in completely different bit allocations.

this is *normal*.

however - and i stress once again - this is AFTER the length/mode has been determined to be 16bit.imm

to have some of the MN=1 unused sub encodings go backwards in time and set a 32bit instruction length or 48 bit or N bit: this we are *not* doing.


> So, are you saying we can't afford to have the 16+16-bit immediate encodings
> with N=M=1, 

incorrect.

Comment 100 Luke Kenneth Casson Leighton 2020-11-30 01:06:42 GMT

(In reply to Jacob Lifshay from comment #98)

> The proposed instructions there are an initial guess, IIRC we plan to
> rearrange/modify the supported compressed instructions to actually follow
> the observed instruction statistics. as part of that, `mr` is highly likely
> to be included in the final instruction set.

i couldn't get it into 10bit without compromising 10bit.

in theory... *sigh* one of the operations (further down the priority such as nor) could be taken off the list in 10bit mode:

    if Cmajor.minor = ABC:
        if 10bit:
             OP = MR
        else:
             OP = NOR

which is...  bleugh to say the least but the idea is, as jacob said, to prioritise by way of statistical analysis.

Comment 101 Jacob Lifshay 2020-11-30 01:13:06 GMT

one other thing:
some assembly-level instructions are really pseudo-ops for more general instructions (e.g. `not r3, r5` is really `nor r3, r5, r5`), those mappings can be different for 10-bit vs. 16-bit vs. 32-bit.

Comment 102 Luke Kenneth Casson Leighton 2020-11-30 01:18:28 GMT

(In reply to Alexandre Oliva from comment #96)
> I'm surprised/confused by some proposed encodings in
> https://libre-soc.org/openpower/sv/16_bit_compressed/ that affect my attempt
> at implementing an accurate compression estimator.
> 
> - is it really the case that an operation as common as 'mr' can only be
> encoded in compressed mode using an 16-bit + 16-bit immediate encoding? 

this sentence is ambiguous

it could be interpreted as "a 32 bit compressed instruction where the 1st 16 bits is opcodes and the 2nd 16 bits is an immediate".  [this type of encoding does actually exist in VLE and i am dead set against us including it]

there is no 16bit + 16bit immediate encoding.  there is the 1st level length+mode identifier FSM.

unlike VLE, no 16 bit instruction indicates that "actually its length is 32 bit and that after the 16 bit instruction opcodes there is a further 16 bits of immediate"

there is however only one way to get to "full" 16bit mode and that is to go first via the 10bit mode.

do you mean, "is the only way to gain access to a 'mr' operation to make the transition through 10bit compressed into 16bit compressed?"

if so the answer is yes, because the space is so tight in 10bit i couldn't think how to do it at the time.

Comment 103 Luke Kenneth Casson Leighton 2020-11-30 01:24:28 GMT

(In reply to Jacob Lifshay from comment #101)
> one other thing:
> some assembly-level instructions are really pseudo-ops for more general
> instructions (e.g. `not r3, r5` is really `nor r3, r5, r5`), those mappings
> can be different for 10-bit vs. 16-bit vs. 32-bit.

there are also several ways to encode 'mr' in v3.0B.   addi rt, ra, 0 is just one.

space in 10/16 bit modes is so tight that we do not wish to waste it with two ways to achieve the same result.

Comment 104 Luke Kenneth Casson Leighton 2020-11-30 01:48:10 GMT

i made a note that 10bit nor is actually mr. 16bit is still nor.

Comment 105 Alexandre Oliva 2020-11-30 01:48:36 GMT

ok, one or both of us are making some incorrect assumptions.  here are the facts that come across as inconsistent to me:

1. https://libre-soc.org/openpower/sv/16_bit_compressed/ says, in 2.1: The current "top" idea for 0b11 is to use it for a new encoding format of predominantly "immediates-based" 16-bit instructions (branch-conditional, addi, mulli etc.)
[...]
| 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit

2. the same page says, in 2.1.1:
| 1 | immf | Cmaj.m | fld1     | imm      | 1 | 16b imm


3. and then, in 2.1.2: Immediate Opcodes only available in 16-bit mode, only available when M=1 and N=1 and when Cmaj.min is not 0b001.1.
| 1 | 0  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
[...]
| 1 | i2           | | 010.0 | RA!=0| imm | 1 | addi

4. in 2.1.9: Condition Register
| 0 1 1 1 | BA2 | | 001.1 | 0 BA | BB  | M | crnand
| 1 0 0 0 | BA2 | | 001.1 | 0 BA | BB  | M | crand
i.e., there is M, but N seems to be repurposed as part of the extended cr opcode, so N=M=1 actually encodes a 16-bit insn without a 16-bit immediate


From that, I conclude that either:

a. addi, in compressed mode, has M=N=1 and therefore is necessarily a 16-bit+16-bit-imm insn, i.e., 16-bit insn formatted as above, followed by a 16-bit immediate, for a total 32 bits, and the CR encoding is an aberration, or

b. addi and other immediates, in compressed mode, repurpose M and N as part of the opcode, sort of, just like CR insns repurpose N alone, and they are actually just a 16-bit insn formatted as above, without a 16-bit extension, and without a possibility of one

now, in either case, we can't just look at M and N to tell the length of an insn while in compressed mode.  we have to also look at the opcode to tell whether those bits actually stand for the regular meanings of M and N, or part of the fixed bit pattern of certain compressed insns.

but then, from your adamant statement that those two bits ought to be enough to tell the compressed insn length, and even that's barely acceptable, I conclude it can't be so, and my head explodes ;-)

Comment 106 Alexandre Oliva 2020-11-30 02:02:15 GMT

the other head-exploding apparent (?) contradiction is that, if we're planning to have compressed insns that could have N=M=1 to gain a 16-bit extension, or other values for N and M to guide the decoding of subsequent insn(s), why, insns that may take N=M=1, must these extra 16 bits be used exclusively to encode (extensions of) immediates, and absolutely never to encode register fields, or extensions of register fields?

the complexity of using 32 bits for a compressed insn is already dealt with before we even look at those extra 16 bits, they're passed-through to the second-stage decoder to be remapped into a 32-bit instruction.

why can the bit-shuffling in this decoder/remapper even deal with immediates of different lenghts, depending on the opcode encoded in the first half-word, and it can even map registers according to a mapping table, but it can't possibly deal with additional bits for register fields, or an additional register field, even one that could be copied straight to the corresponding uncompressed insn, without any mapping whatsoever, if that's the format (input and output of the remapper) for the selected opcode?

In both cases, we're going to be rearranging bits from a pair of half-words to the opcode-dictated 32-bit instruction format.  And yet you seem to react with a knee-jerk rejection to one, and a grand-fathering of the other, while they seem entirely equivalent to me.  Can you elaborate on the reasons for the differences?

Comment 107 Alexandre Oliva 2020-11-30 02:14:39 GMT

I think a simpler way to phrase the question in comment 106 is: why would the remapper (decompressor?) care whether the bits coming from the 16-bit "immediate" half-word are shuffled into parts of the decompressed insn that the decoder proper will interpret as part of an immediate field, rather than a register field, or even an extended opcode field?

Comment 108 Luke Kenneth Casson Leighton 2020-11-30 03:19:04 GMT

(In reply to Alexandre Oliva from comment #105)

> 1. https://libre-soc.org/openpower/sv/16_bit_compressed/ says, in 2.1: The
> current "top" idea for 0b11 is to use it for a new encoding format of
> predominantly "immediates-based" 16-bit instructions (branch-conditional,
> addi, mulli etc.)

where the length is, and always has been, including the bits of the immediate, 16 bits total and 16 bits only.


> From that, I conclude that either:
> 
> a. addi, in compressed mode, has M=N=1 and therefore is necessarily a
> 16-bit+16-bit-imm insn, i.e., 16-bit insn formatted as above, followed by a
> 16-bit immediate, for a total 32 bits, 

if this was the case, i would have illustrated it with a clear diagram, showing a total of 32 bits.

also i would have put, "this is a total of 32 bits"

also i would have said, "this is identical to the 32 bit VLE encoding".

none of those things are said, therefore logically we conclude that (a) is erroneous.


the reasons as twofold:

1) if you are going to have a 32 bit encoding that includes 16 bit immediate, you might as well use...  the standard v3.0B mode

2) the complexity of mixing phase 1 and phase two is far too great.


even reason (1) is enough to permanently and irrevocable eliminate 16+16 encoding from all consideration.



> b. addi and other immediates, in compressed mode, repurpose M and N as part
> of the opcode, sort of, just like CR insns repurpose N alone, and they are
> actually just a 16-bit insn formatted as above, without a 16-bit extension,
> and without a possibility of one

this is correct.
 
> now, in either case, we can't just look at M and N to tell the length of an
> insn while in compressed mode. 

this is incorrect.  the length is 16bit, period.

now, it happens to be the case that at rhe start of the length/mode FSM, when the 16 bits contains an EXTNNN you IGNORE the 1st 5 bits (the remainder being a 10bit encoding) but that does not change the fact that it is still 16 bits.
 

> we have to also look at the opcode to tell
> whether those bits actually stand for the regular meanings of M and N, or
> part of the fixed bit pattern of certain compressed insns.

no, you look at the 2 bits as part of the length/mode FSM.

once you have the mode, the meaning of M and N switches to one of 3 sub-modes, these are:

* 10bit mode (ignore EXTNNN bits)
* 16bit mode
* 16bit.immed mode

again: repeat again.  repeat again: *all of these are exactly 16 bits*.

repeat again: all are 16 bits.

not 16+16.

not 10.

not 32

16 and 16 only.


> but then, from your adamant statement that those two bits ought to be enough
> to tell the compressed insn length,

when in Compressed mode, the length is 16 bits.  period.

decoding of those 16 bits requires the FSM 

roughly as follows:

prev=std.32 # start of FSM
insn = insnbitstream[0..15] # ONLY 16 bits
                            # NEVER 32

if prev == std.32:
   if insn[EXTBITS] == Compressed
        next = C.10bit
if prev == C.10bit:
   if insn[15] == 1:
        next = C.16bit
   else
        next = std.32
if prev == C.16bit:
    if insn[0] == 1 and insn[15] == 1
        DECODE 16 BITS AS C.16.immediate
        CRITICAL NOTE, LENGTH IS 16 BITS
        NOT REPEAT NOT 32
    elif insn[15] == 0
        next = std.32.oneonly
    elif insn[0] == 1 and C.Maj == 0b010
        DECODE 16 BITS AS ALTERNATE
        STUFF
        LENGRH IS STILL 16 BIT

current = next

something like that.  it is late so i have not checked the page, some of those may be ==0 not =1 for bit 0 / 15.

the **ONLY** time that the length is not 16 bit in Compressed mode is when the **NEXT** instruction is a Standard v3.0B encoding.

i.e. when LEAVING the Compressed mode entirely.

again.

to repeat.

length is ONLY ever 16 bit in Compressed mode.

length is NEVER 32 bit in Compressed mode.

Comment 109 Alexandre Oliva 2020-11-30 04:47:45 GMT

/me blinks, blinded by the light that just hit him

yeah, now it all makes sense. I found it so hard to believe that the 16-bit immediate insns were limited to such narrow immediate fields that I managed to twist 16bit/imm into 16+16. I wanted it so bad that I made it so. *blush* wow, what a mess I have made! apologies for being so dense, and for the noise.

I (think I) get it now. compressed mode is 16-bit to enter (called 10-bit, because 6 bits are used by the opcode that tells us it's a 16-bit insn we're lookng at, rather than the usual 32-bit), and 16-bit only while in compressed mode. now, some of the 16-bit insns may only switch back to 32-bit mode permanently, while others can also switch for one insn (N is not repurposed), and some can't switch out at all (M is repurposed)

this means the logic in the compression estimator I wrote on Saturday is all wrong. dang!

ok, I went through it all again, under this new understanding, and though it's almost as if I was looking at a fully rewritten document, it all makes a new sense to me ;-)

new questions pop up:

- what's the use of N, when M is not there, as in the "16-bit mode only" blocks of encodings under 2.1.7 Logical and 2.1.8 Floating Point insns?

- "SV Prefix over-rides help provide alternative bitwidths for LD/ST" -> I had previously assumed this was an instruction prefix, a little bit like the extend-next-insn opcode I'd suggested before. Now, I assume it's not. how does that work, then? I suppose I may have to know it, to estimate correctly the compression of these instructions.

Comment 110 Alexandre Oliva 2020-11-30 05:16:17 GMT

some more thinking about the massive alignment muxer, and opcode formats

we use 6 bits, the primary opcode, to tell whether we're looking at a regular 32-bit insn, or a 10-bit insn.

since any presumed 32-bit insn may turn out to be one of these 10-bit insns, we have to be prepared to have another 32-bit (or 10-bit) insn at the next half-word, or a 16-bit insn, depending on yet another bit

IOW, the boolean function that tells us whether the next insn starts at the next half-word, or at the next word, takes current mode, the 6 opcode bits, one of which overlaps with the N bit, and the M bit, that doesn't.

wouldn't it make sense to move the M bit so that it overlaps with the primary opcode, so as to (potentially?) simplify the muxer's alignment decision, by reducing the number of bits it has to look at?

indeed, this would enable us to drop entirely the requirement for a pair of opcodes that differ only in the least-significant bit of the primary opcode.  we could take any pair of primary opcodes that differed by a single bit, and use that single bit as the M bit.

looking further into minimizing the inputs that the muxer has to take into account...  what if 16-bit insns had Cmaj.m placed in the remaining primary opcode bits (i.e., the ones not used for M and N)?  then, the muxer wouldn't have to look at different bits to tell the length of the present insn

we'd go from a function of (mode, 0..8 (primary opcode and Cmaj.m), M), 11 bits, to (mode, 0..5), 7 bits.

it's not obvious to me that it would actually provide us with any function minimization opportunities, or just require more hardware buffers for the bits that would then get more uses.  what does your intuition tells you?

Nicely, changing these would only require mechanical changes to the current encodings.

Comment 111 Jacob Lifshay 2020-11-30 05:19:45 GMT

(In reply to Alexandre Oliva from comment #109)
> - "SV Prefix over-rides help provide alternative bitwidths for LD/ST" -> I
> had previously assumed this was an instruction prefix, a little bit like the
> extend-next-insn opcode I'd suggested before.

It is, actually, though we haven't yet figured out if/how it will work with anything other than 32-bit operations as the base instruction. We did decide 64-bit base instructions won't get SVPrefix versions, since those are mostly useful for scalar only and we don't want to handle 80/96-bit instructions. 64-bit is the biggest instruction we'll support.

>  Now, I assume it's not.  how
> does that work, then?  I suppose I may have to know it, to estimate
> correctly the compression of these instructions.

SVPrefix is supposed to provide the element width for prefixed instructions since most instructions don't support 8/16-bit operations (no 8-bit add). Since load/store instructions *do* provide all of the required element widths (at least for integer load/store instructions, I don't think Power has f16 load/store, though I'm likely wrong), those bits in the prefix can be reused to encode things load/store specific such as a strided load. Alternatively we could require those bits anyway to simplify the decoder.

Currently, there are no defined instructions for compressed 16-bit instructions using SVPrefix, so that won't yet need to be accounted for. Even when we do want to start defining smaller-than-64-bit SVPrefix instructions, we don't yet have any compiled code that uses SVPrefix (since the compiler support is not yet implemented), so we'd be making an educated guess.

Comment 112 Cole Poirier 2020-11-30 05:28:24 GMT

(In reply to Alexandre Oliva from comment #109)
> /me blinks, blinded by the light that just hit him
> 
> yeah, now it all makes sense.  I found it so hard to believe that the 16-bit
> immediate insns were limited to such narrow immediate fields that I managed
> to twist 16bit/imm into 16+16.  I wanted it so bad that I made it so. 
> *blush*  wow, what a mess I have made!  apologies for being so dense, and
> for the noise.
> 
> I (think I) get it now.  compressed mode is 16-bit to enter (called 10-bit,
> because 6 bits are used by the opcode that tells us it's a 16-bit insn we're
> lookng at, rather than the usual 32-bit), and 16-bit only while in
> compressed mode.  now, some of the 16-bit insns may only switch back to
> 32-bit mode permanently, while others can also switch for one insn (N is not
> repurposed), and some can't switch out at all (M is repurposed)

For what it’s worth, I just reviewEd the 16b compressed encoding page and I can definitely see how Alexandre came to the operating paradigm i.e. set of axioms interpreted from the wiki page, that he was operating under until this comment. I find the current diagrams and explanations on the page very confusing, especially because the discussion of C16 does also discuss SV prefix at points. Noticing that this will be the 110th comment on this bug report, may I suggest we open a new bug report entitled “Editing and reorganizing  16 bit compressed encoding proposal for greater clarity”? I think all of the necessary information is there, however, it can be presented in with a small amount of reorganization and editing to be immediately smacks-you-in-the-forehead obvious, instead of rather difficult, easy to misunderstand/misinterpret, and confusing.

Comment 113 Alexandre Oliva 2020-11-30 08:21:50 GMT

actually, this thinking further of the detection of 10-bit insns, mode-switching or not, led me to an interesting realization

our reasoning seems to be significantly affected from a perception distortion out of the notion that one mode has 32-bit insns, while the other has 16-bit insns

well, that's not true.  an instruction appearing while in uncompressed mode requires us to look at 6 of its bits to tell whether the next instruction is at the next half-word, or at the next word, and at another bit on top of them (unless we move M as I suggested) to tell how to look at (i.e., what mode it's in)

now let's change the perspective a bit.  instead of starting from the premise that uncompressed insns are 32-bit and compressed ones are 16-bit, let's start from a premise that is just as invalid, namely, that insns in either mode take 16-bits

in uncompressed mode, we look at the primary opcode to tell whether it really is a 16-bit (or 10-bit, as we call them) instruction, or whether it extends over 32, or even 64 bits!

but in compressed mode, we can't have even one bit or opcode to extend to tell the insn isn't just 16-bits long, but rather 32, because...  somehow the reasons that make it acceptable in one mode don't apply to the other, even if we had to look at the very same 6 bits plus mode?!?

conversely, if looking at all those bits is so bad that it's unacceptable in compressed mode, maybe it's just as bad in uncompressed mode and we should find another way to go about it, one that doesn't take up two of the few major opcodes.

consider, for inspiration, something that won't quite work because of the 64-bit instructions:

- 32-bit words may contain either a single 32-bit insn, or a pair of 16-bit insns

- there's one 32-bit opcode, primary or extended, that encompasses:

-- 6+ bits to make it recognizable

-- 16-bits for a 16-bit insn

-- up to 10 bits that guide the decoding of *statically* subsequent insns

- there's also one 16-bit opcode that encodes:

-- 4+ bits to make it recognizable

-- up to 12 bits to guide the decoding of statically subsequent insns

these bits are similar in purpose to our current M and N, but they're packed into an insn like the above, so that the decoder can readily tell, even ahead of time, where insns are, and how to decode them

two possibilities of encoding of these bits readily occur to me:

- 1 bit per word, telling whether the word holds a 32-bit insn or a pair of 16-bit insns.  all 32-bit insns remain aligned.

- 1 bit per upcoming insn, telling whether it's a 16- or 32-bit insn

whether such bits, when present in an insn, queue up after bits that are already there, or reset the queue and start affecting the next insn, would have to be worked out.  there are upsides and downsides both ways: one favors keeping a long queue so that alignments for more insns per cycle can be precomputed; the other favors locality.  The latter, combined with a rule that only the last insn in the queue may contain additional decoding bits.  If the queue runs out, we return to traditional mode; ditto when we take a branch.

this enables us to use the full 16 bits of compressed insns, even in the insn that starts pre-encoded-length mode (though that one still takes up 32 bits), since the M and N bits of two consecutive insns are compressed into a single bit or two elsewhere.

with 1 bit per word, the most compact encoding starts a sequence with a 32-bit insn that carries 10 bits of pre-length of its own, plus 12-bits of pre-length encoded in the 16-bit insn embedded in it.  then, it may go for 44 16-bit insns before the queue runs out.

the sequence can be extended indefinitely with one pre-length compressed insn at every 24 16-bit insns.  thus a compressed ratio limit compressed : uncompressed is (24 * 2) : (23 * 4) = 52,17%.  in this scenario, each 16-bit compressed insn payload takes up 16.6956 bits (24 insns to encode 23) in total, an overhead of 4.34%.

Contrast with our attempt 1, that, in the same limit regime, uses 16 bits per 14-bit payload, an overhead of 14.28%

a 32-bit insn that fits a 16-bit encoding is a break-even point, and if there's any pair of 16-bit insns within the 11 subsequent insns, it is advantageous to go into pre-length encoding, and it remains advantageous if there are any other such pairs every 13 insns

contrast with the attempt 1: a mode-switch is profitable only if it fits in 10-bits; if it fits only in 16-bit mode, it's break-even, and advantageous only if there are other insns fitting 16-bit mode every other insn


the 1-bit-per-insn (vs -per-word) pre-length encoding is not quite as compact, but it does away with the need for pairing up 16-bit insns, even if with pre-length encoding ones, at the expense of misaligned 32-bit insns.  we can go for 22 compressed insns before the initial max 22 bits run out.

the sequence can be extended indefinitely with one pre-length compressed insn at every 12 16-bit insns.  thus a lower bound for the compression ratio is (12 * 2) : (11 * 4) = 54,54%.  in this scenario, each 16-bit compressed insn payload takes up 17.4545 bits (12 insns to encode 11) in total, an overhead of 9.09%.

a 32-bit insn that fits a 16-bit encoding is a break-even point, and if there's any 16-bit insn within the 10 subsequent insns, it is advantageous to go into pre-length encoding, and it remains advantageous if there are any other such insns every 12 insns

now, I said early on that these encoding schemes wouldn't work if we need to support 64-bit insns, but they actually do, they just don't make the alignment quite as simple, since any of the insns marked as traditionally-encoded might actually take up 2 words, instead of just one.  it defeats part of the purpose of the pre-length encoding, but maybe not so much as to render it useless

thoughts?

Comment 114 Jacob Lifshay 2020-11-30 08:33:17 GMT

(In reply to Alexandre Oliva from comment #113)
> thoughts?

How will 48-bit instructions fit into that scheme? It seems to me as though we need to support align-2 16/32/48/64-bit instructions instead of requiring align-4 for 32/64-bit in order for 48-bit instructions to be better than full 64-bit instructions.

Comment 115 Jacob Lifshay 2020-11-30 08:49:48 GMT

(In reply to Alexandre Oliva from comment #113)
> whether such bits, when present in an insn, queue up after bits that are
> already there, or reset the queue and start affecting the next insn, would
> have to be worked out.  there are upsides and downsides both ways: one
> favors keeping a long queue so that alignments for more insns per cycle can
> be precomputed; the other favors locality.  The latter, combined with a rule
> that only the last insn in the queue may contain additional decoding bits. 
> If the queue runs out, we return to traditional mode; ditto when we take a
> branch.

If we do need a separate instruction for filling up the 16-bit queue, I'd argue for it to have absolutely no other effect (no setting status flags, etc.), that way we can completely delete it at the decode phase, allowing us to issue more instructions per cycle. This doesn't mean we can't have all 16-bit instructions have a bit reserved for adding an entry to the 16-bit queue, since those are otherwise full instructions that aren't a waste of issue space.

Comment 116 Alexandre Oliva 2020-11-30 08:57:23 GMT

great question.  I've heard about them, but I don't know where they fit in.  I thought they would fit in yet another execution mode, so I was not concerned about them.

some possibilities:

- 1 bit per insn, telling the encoding mode, not necessarily the length.  not quite as useful for the monster muxer, but enough to enable the compact mode-switching possibilities

- 2 bits of pre-length encoding for each insn, so the 4 (known to me?) lengths can be represented

- use enough bits to cover equivalent lengths, even if not all bits cover insns.  for 48-bit, we'd need 1 bit for 16-bit and one for 32-bit.  which one comes first depends on the mode in which the 48-bit insn is encoded, but 48-bit insns would consume two bits from the queue.  ditto for 64-bit insns, that would encode 2 bits for 32-bit insns.  the muxer would have to peek at insns, but maybe not quite as much as it would have to without the helper bits.

Comment 117 Jacob Lifshay 2020-11-30 09:18:16 GMT

(In reply to Alexandre Oliva from comment #116)
> great question.  I've heard about them, but I don't know where they fit in. 
> I thought they would fit in yet another execution mode, so I was not
> concerned about them.
> 
> some possibilities:
> 
> - 1 bit per insn, telling the encoding mode, not necessarily the length. 
> not quite as useful for the monster muxer, but enough to enable the compact
> mode-switching possibilities

This is the approach I'd take, since the primary opcode already is used in PowerISA v3.1 to determine if an instruction is 32/64-bit. Add 48-bit to that, supporting 32/48/64-bit instructions in the PowerISA v3.1 compatible encoding (I named it Standard Mode), we only need the alternate encoding (I named it Compressed Mode) for 16-bit instructions.

> - 2 bits of pre-length encoding for each insn, so the 4 (known to me?)
> lengths can be represented

4 different lengths is correct afaik: 16/32/48/64-bits
> 
> - use enough bits to cover equivalent lengths, even if not all bits cover
> insns.  for 48-bit, we'd need 1 bit for 16-bit and one for 32-bit.  which
> one comes first depends on the mode in which the 48-bit insn is encoded, but
> 48-bit insns would consume two bits from the queue.  ditto for 64-bit insns,
> that would encode 2 bits for 32-bit insns.  the muxer would have to peek at
> insns, but maybe not quite as much as it would have to without the helper
> bits.

Comment 118 Luke Kenneth Casson Leighton 2020-11-30 11:46:29 GMT

(In reply to Jacob Lifshay from comment #111)

> >  Now, I assume it's not.  how
> > does that work, then?  I suppose I may have to know it, to estimate
> > correctly the compression of these instructions.
> 
> SVPrefix is supposed to 

does.  SVPrefix *does* provide an elwidth override.

> provide the element width for prefixed instructions
> since most instructions don't support 8/16-bit operations (no 8-bit add).
> Since load/store instructions *do* provide all of the required element
> widths (at least for integer load/store instructions, 

unfortunately - and this is why it's in the Compressed spec - there's simply
not enough room to have Compressed halfword and bytelevel LD/ST.

consequently i made the note, "if you want 8/16-bit LD/ST, use SV Prefix elwidth overrides".


> I don't think Power
> has f16 load/store, though I'm likely wrong), 

haven't seen it

> Currently, there are no defined instructions for compressed 16-bit
> instructions using SVPrefix,

that's incorrect... or... it is and it isn't.

SVPrefix is *specifically* designed so that it applies *uniformly* to *any* instruction with no "interference" between the two (i.e. it is a 1st-phase length/mode decode).

therefore, to say "there are no defined instructions for compressed 16-bit instructions using SVPrefix" is misleading.

*all* Compressed instructions may have SVPrefix applied to them... *WITHOUT* any "specific definition being applied".

that's the entire point of SVPrefix.  therefore, by *defining* a new Compressed instruction, **AUTOMATICALLY** it **IS** the case that a SV-Compressed version exists and **IS** defined... [without us actually explicitly defining one]

caveats to that:

1) certain instructions (trap, branch, nop) you simply can't vectorise.  ever.  adding an SVPrefix to "trap" makes absolutely no sense and is simply not possible.

2) in the past - multiple times, jacob and i point out every time that it needs to be a last resort - you have envisaged that certain brownfield encodings of SVPrefix when applied to certain encodings should be allocated *completely* different meanings.  either

   a) bits in the SVPrefix are set that change the meaning of the
      instruction (complete new decoder)
   b) bits in the *instruction* which, if prefixed, have a completely
      different meaning
   c) both of the above

  whilst we may in fact have to do that, it should be a route of last resort, because of the back-interference it causes between phase 1 (length/mode decode) and phase 2 (actual opcode decode)

3) we have not yet done the register allocation for Compressed, therefore
   it makes no sense to do the SVPrefix-Compressed register allocation

so:

in the sense (1) you were correct in that the exception (nonsensical) SVP-Compressed instructions have not been indentified

in the sense (2) you were correct in that no "last resort" completely-different-from-what-is-obvious-application of SVP to Compressed re-encodings have been defined... but only because if you had tried to suggest any i would have raised a massive red flag / alarm bell

in the sense (3) you are correct because we haven't got that far (register allocation)

however in the general sense, giving the impression that SVP-Compressed "does not exist because it requires every single opcode to be listed as defined", this is definitely 100% categorically false.

*every* Compressed instruction inherently and automatically and implicitly *is* SVPrefix-able [except nonsensical caveats such as branch] because that's how SVPrefix works.

Comment 119 Luke Kenneth Casson Leighton 2020-11-30 11:52:32 GMT

(In reply to Jacob Lifshay from comment #117)
> (In reply to Alexandre Oliva from comment #116)

> > - 1 bit per insn, telling the encoding mode, not necessarily the length. 
> > not quite as useful for the monster muxer, but enough to enable the compact
> > mode-switching possibilities
> 
> This is the approach I'd take, since the primary opcode already 

....

jacob given the massive confusion in alexandre's mind can i recommend
*not* discussing alternative speculative Compressed encoding schemes
until it is absolutely clear for alexandre, and he's managed to implement
the objdump analyser?

bringing in alternative designs at this point when we do not yet know
if he fully understands the current one will result in taking up time.

if he fully understood the alternative speculative encoding already
then it would be a useful comparative data-point.

however the only person who fully understands that alternative speculative
encoding right now is you.

consequently you first have to document it (fully) and then explain it to everybody, make sure that everybody fully understands it...

... *and then* use it as a comparative data point to help alexandre understand the v1 proposal.

Comment 120 Luke Kenneth Casson Leighton 2020-11-30 12:31:47 GMT

(In reply to Alexandre Oliva from comment #109)
> /me blinks, blinded by the light that just hit him
> 
> yeah, now it all makes sense.  I found it so hard to believe that the 16-bit
> immediate insns were limited to such narrow immediate fields that I managed
> to twist 16bit/imm into 16+16.  I wanted it so bad that I made it so. 
> *blush*  wow, what a mess I have made!  apologies for being so dense, and
> for the noise.

no problem.

the important thing here is to appreciate that we have to do a 2-phase
analysis (1: length/mode, 2: full-decode), and that, because it is hardware,
the 2nd phase absolutely cannot go backwards in time and interfere with the 1st.

the 1st phase *has* to determine length/mode in the absolute most simplest, most basic, most gate-minimalist fashion possible.

the 2nd phase is where the length/mode must be considered, categorically, to be "fixed", and because it is fixed, you can now take longer (and use more gates), have much more complex decoding to get at fields, immediates, register numbers, sub-encodings etc. etc. etc. etc.... but all at a **FIXED** instruction width.


> I (think I) get it now.  compressed mode is 16-bit to enter (called 10-bit,
> because 6 bits are used by the opcode that tells us it's a 16-bit insn we're
> lookng at, rather than the usual 32-bit),

correct.  this is the entry-point to the FSM.

> and 16-bit only while in compressed mode. 

correct.

> now, some of the 16-bit insns may only switch back to
> 32-bit mode permanently, while others can also switch for one insn

correct.  otherwise, think about it: how could you get back to Compressed
mode? the only other way would be via the 10-bit encoding, which is of
course wasteful.

> (N is not repurposed), and some can't switch out at all (M is repurposed)

this is confusing me, i can't understand what you mean from the use of the word "repurposed".  the two bits N and M are relevant to the 16-bit FSM.  they're listed in "Opcodes exploration (attempt 1)" section.


> this means the logic in the compression estimator I wrote on Saturday is all
> wrong.  dang!

:)

> 
> ok, I went through it all again, under this new understanding, and though
> it's almost as if I was looking at a fully rewritten document, it all makes
> a new sense to me ;-)

:)

> new questions pop up:
> 
> - what's the use of N, when M is not there, as in the "16-bit mode only"
> blocks of encodings under 2.1.7 Logical and 2.1.8 Floating Point insns?

i didn't want to have to duplicate the contents of the 10-bit encodings in a completely separate table, because (a) it's a lot more work and (b) the 16-bit is an extension of 10-bit anyway.

so remember: there are 4 possibilities for N and M:

* N=0, M=0
* N=1, M=0
* N=0, M=1
* N=1, M=1

however there is also 10-bit mode and 16-bit mode, so it is actually:

             | bit 0 | bits 1-4 | bits 6-14       bit 15 | 
* 10-bit mode (EXTNNN=Compressed)  10-bit-encoding   M=0
* 10-bit mode (EXTNNN=Compressed)  10-bit-encoding   M=1
* 16-bit mode N=0 ...... ..... ...... .... .....  .. M=0
* 16-bit mode N=1 ...... ..... ...... .... .....  .. M=0
* 16-bit mode N=0 ...... ..... ...... .... .....  .. M=1
* 16-bit mode N=1 ...... ..... ...... .... .....  .. M=1

and the whole reason why the 10-bit Compressed exists is because: N (bit 0)
is *not accessible* when it's *also* used as part of the EXTNNN v3.0B Major
Opcode.


now, having explained all that: what you've missed is: take a look at
bit **ONE** in the 2 sections you refer to.

you see how insn[1] is zero in the first table, and one in the second?
this is therefore a sub-sub-encoding:

Cmaj.minor.sub

* Cmaj: bits 5-7
* minor: bit 8
* sub-sub-encoding: bit 1

but... look again at 10-bit mode.

             | bit 0 | bits 1-4 | bits 6-14       bit 15 | 
* 10-bit mode (EXTNNN=Compressed)  10-bit-encoding   M=0
* 10-bit mode (EXTNNN=Compressed)  10-bit-encoding   M=1

how can we use bit 1 to mean anything when entering the Compressed FSM?

we can't, can we?

therefore, bit 1 is ***ONLY*** relevant (useable) when we are **ALREADY** in the Compressed (16-bit) FSM Mode.

yes?

in other words, you **CANNOT** get access to fdiv, fabs, fmr. or xor, extsb etc. *UNTIL* you are in the 16-bit FSM, aka "16-bit mode".

and you get access to those instructions - only in 16-bit mode - by checking bit 1.  which you CANNOT use on entering the FSM (10-bit mode) because bit 1 is part of the EXTNNN identifier.


> - "SV Prefix over-rides help provide alternative bitwidths for LD/ST" -> I
> had previously assumed this was an instruction prefix,

it is - and it's a 1st-phase (length/mode) decoder.

> a little bit like the
> extend-next-insn opcode I'd suggested before.

it is... but it's at the 1st level (1st phase decode).  what you proposed was to massively complexify and intertwine 1st and 2nd phase decoding (intertwine length/mode decoding with actual instruction decoding) which would have required going back in time to achieve efficiently.

>  Now, I assume it's not.  

this would be an incorrect assumption.  it's called "SV Prefix" because it's a prefix.


> how does that work, then? 

i strongly recommend leaving that question for now and/or asking under the SVPrefix bugreport.  this bugreport is for Compressed, not SV-Prefix-Compressed.


> I suppose I may have to know it, to estimate
> correctly the compression of these instructions.

no.  the current context is: Compressed without SVPrefix.  Scalar (non-vectorised) Compressed.

SVPrefix is completely out of scope for the purposes of the current discussion (other than as an incidental / ancillary / future work).

let's get through Compressed first.  please completely disregard SVPrefix (Vectorisation) for now.

Comment 121 Luke Kenneth Casson Leighton 2020-11-30 12:50:45 GMT

(In reply to Alexandre Oliva from comment #110)
> some more thinking about the massive alignment muxer, and opcode formats
> 
> we use 6 bits, the primary opcode, to tell whether we're looking at a
> regular 32-bit insn, or a 10-bit insn.

actually, it's 6 bits to indicate a regular 32-bit instruction
however for entering 16-bit mode (which includes the 10bit variant)
it requires *two* Major Opcodes.  examples:

EXT000 (6 bits)
EXT001 (another 6 bits)

(please don't comment on how those are already used by v3.0B: they're given for
illustrative purposes only)

the MSBs of those 2 encodings are the same (0b00000).  the LSB is different.
therefore, EXT000 selects **ONE HALF** of the 10bit encoding space, whilst
EXT001 selects the **SECOND** half.

this is explained at the beginning, in the introduction.  search
for this text:

    The reason for picking 2 contiguous Major v3.0B opcodes is illustrated below:

|0 1 2 3 4 5 6 7 8 9 a b c d e f|
|major op..0| LO Half C space   |
|major op..1| HI Half C space   |
|N N N N N|<--11 bits C space-->|



> since any presumed 32-bit insn may turn out to be one of these 10-bit insns,
> we have to be prepared to have another 32-bit (or 10-bit) insn at the next
> half-word, or a 16-bit insn, depending on yet another bit

errr.... no.  emphatically no.

the FSM is very simple, and very clear.  i outlined it in comment #108 and
it most definitely and categorically does not need "yet another bit".

* 32 bit mode instructions may ONLY be followed by 10-bit ones
* 10-bit ones may be followed ONLY by either:
  - 16-bit ones (M=1)
  - 32-bit ones (M=0)
* 16-bit ones may be followed ONLY by:
  - 16-bit ones
  - 32-bit ones

there is no need for "more bits".


> wouldn't it make sense to move the M bit so that it overlaps with the
> primary opcode, so as to (potentially?) simplify the muxer's alignment
> decision, by reducing the number of bits it has to look at?

this is a common mistake made by software engineers, who think in terms of instructions that cannot perform arbitrary bit-level manipulation.

hardware is dead simple: you just route the wire to put it into the "thing".

done.

therefore the positioning of the bits - 5, 8, 3, 23, 16 - is completely irrelevant.

bit M does not need to move position just for the convenience of software engineers.

in hardware, whether bit M is at position 5 or position 15 is utterly irrelevant: in both cases it's just "a wire".


> indeed, this would enable us to drop entirely the requirement for a pair of
> opcodes that differ only in the least-significant bit of the primary opcode.
> we could take any pair of primary opcodes that differed by a single bit, and
> use that single bit as the M bit.

i deleted the rest and am ignoring this paragraph for now until you've understood the difference between software engineering and hardware engineering.  bit-positions in hardware are simply "conventions" and the routing is done by wires.

bit-positions in *software* are relevant because 100% of practical real-world general-purposes processors are capable of efficient arbitrary bit-level manipulation.

if after that misunderstanding is cleared up there is something that is relevant please do ask.

Comment 122 Luke Kenneth Casson Leighton 2020-11-30 12:55:01 GMT

(In reply to Alexandre Oliva from comment #110)

> some more thinking about the massive alignment muxer, and opcode formats

remember: that alignment muxer is at PHASE ONE.  length/mode decoding
and length/mode decoding only.

remember always to separate that out in your head from phase two which
is where you further analyse major[.minor][.sub] fields to work out which operation is which.

[edit: and this 2nd phase is done safe in the knowledge that, at that time, the length and mode HAVE been determined and fixed, and WILL NOT change]

Comment 123 Luke Kenneth Casson Leighton 2020-11-30 12:57:52 GMT

(In reply to Jacob Lifshay from comment #114)
> (In reply to Alexandre Oliva from comment #113)
> > thoughts?
> 
> How will 48-bit instructions fit into that scheme?

this bugreport is for the discussion of Compressed 16-bit encoding.

>  It seems to me as though
> we need to support align-2 16/32/48/64-bit instructions instead of requiring
> align-4 for 32/64-bit in order for 48-bit instructions to be better than
> full 64-bit instructions.

please can you move anything not related to 16-bit Compressed encodings to
the relevant bugreport.  we need to be quite strict about this because these are complex discussions that will get extremely long otherwise, and we won't be able to find anything.

Comment 124 Luke Kenneth Casson Leighton 2020-11-30 13:39:25 GMT

(In reply to Alexandre Oliva from comment #113)
> actually, this thinking further of the detection of 10-bit insns,
> mode-switching or not, led me to an interesting realization
> 
> our reasoning seems to be significantly affected from a perception
> distortion out of the notion that one mode has 32-bit insns, while the other
> has 16-bit insns

not a distortion: a strict decision based on the amount of gates it takes.

> well, that's not true.  an instruction appearing while in uncompressed mode
> requires us to look at 6 of its bits to tell whether the next instruction is
> at the next half-word, or at the next word,

this is incorrect.  the FSM is pretty clear.  the transition *in* to the
16-bit length/mode part of the FSM uses those 6 bits.

this is the **ONLY** time when the 6 bits are read.

if you look again closely at the FSM you will find that **AT NO TIME**,
when the FSM has detected "length==16" for the current instruction, does
the examination of bits beyond those 16 occur.

this is EXACTLY AND PRECISELY why the FSM was designed the way it is designed, so that once the length/mode is known, the length/mode is KNOWN.

is there anywhere in the FSM (comment #108) that says:

   if the mode is 16 bit, actually go look at the future bits 17 thru 31
   and check to completely change the length and meaning to 32, 48 or 64 bit?

is there anywhere in the FSM that says:

   if the mode is 16 bit, actually go read bits 17 thru 31 to see if there's
   another EXTNNN?

no, they do not, do they?


> and at another bit on top of
> them (unless we move M as I suggested) 

which is based, as best i can tell, on a false misapprehension of the difference between hardware and software.


> now let's change the perspective a bit.  instead of starting from the
> premise that uncompressed insns are 32-bit and compressed ones are 16-bit,
> let's start from a premise that is just as invalid, namely, that insns in
> either mode take 16-bits

no.  please can we not do that whilst we are in the middle of trying to make sure that you fully understand the current (v1) encoding scheme?


> in uncompressed mode, we look at the primary opcode to tell whether it
> really is a 16-bit (or 10-bit, as we call them) instruction, or whether it
> extends over 32, or even 64 bits!

no, please.  please do not discuss variable-length encodings in the Compressed (16-bit) bugreport.

i don't *mind* discussion and speculation on alternative variable-length encodings... just not under this bugreport which for some reason has become far more complex than it actually is.


> but in compressed mode, we can't have even one bit or opcode to extend to
> tell the insn isn't just 16-bits long, but rather 32, because...  somehow
> the reasons that make it acceptable in one mode don't apply to the other,
> even if we had to look at the very same 6 bits plus mode?!?

this tells me that it still has not yet sunk in that the two phases (1: length/mode detection and 2: instruction decoding) are still not yet completely separated in your mind.

i keep emphasising how critically important it is that you separate those two phases.

once you have done so, and accepted it, things will become clearer for you.


> conversely, if looking at all those bits is so bad that it's unacceptable in
> compressed mode, maybe it's just as bad in uncompressed mode and we should
> find another way to go about it, one that doesn't take up two of the few
> major opcodes.

i've already identified 16 major opcodes for re-use.  there are not "few", at all.

however, again: this is not for discussion under the Compressed encoding bugreport.


below is an alternative Compressed encoding so is relevant for this bugreport, i'm happy to go through it (as best i can)

> consider, for inspiration, something that won't quite work because of the
> 64-bit instructions:
> 
> - 32-bit words may contain either a single 32-bit insn, or a pair of 16-bit
> insns
> 
> - there's one 32-bit opcode, primary or extended, that encompasses:
> 
> -- 6+ bits to make it recognizable
> 
> -- 16-bits for a 16-bit insn
> 
> -- up to 10 bits that guide the decoding of *statically* subsequent insns

right.  this was - i believe - one of the very early ideas (18 months ago) that was discarded.

the reason is because the fixed overhead of using the 10 bits to "enter" 16-bit mode turns out to be very large when you have a huge quantity of switching (mixed standard v3.0B and Compressed instructions).

consider how and whether the 10 bits could be used to efficiently encode this scenario:

    Compressed v3.0B Compressed v3.0B Compressed v3.0B Compressed

you'll almost certainly find that the space required to do so is *GREATER* than that of just leaving them as v3.0B 32-bit instructions.


> - there's also one 16-bit opcode that encodes:
> 
> -- 4+ bits to make it recognizable
> 
> -- up to 12 bits to guide the decoding of statically subsequent insns

these types of encodings were investigated (2 years ago) for something called "VBLOCK".  they're... extremely complex and took about 8 months to design.

i would greatly prefer that we tackle the simpler encodings first.


> these bits are similar in purpose to our current M and N, but they're packed
> into an insn like the above, so that the decoder can readily tell, even
> ahead of time, where insns are, and how to decode them

it's basically a form of hardware compression, and the fixed overhead not only penalises short instruction runs, it also produces a stunningly-complex FSM that requires a huge amount of thought and planning.

unfortunately we don't have time for that.  we did, 18+ months ago: we do not, now.

so unfortunately, such encoding schemes need to be eliminated from discussion right now and re-visited in 12-18 months time, once we have 100% completed NLnet Projects and can apply for additional funding.

Comment 125 Luke Kenneth Casson Leighton 2020-11-30 13:48:52 GMT

(In reply to Jacob Lifshay from comment #115)

> If we do need a separate instruction for filling up the 16-bit queue, I'd
> argue for it to have absolutely no other effect (no setting status flags,
> etc.), that way we can completely delete it at the decode phase, allowing us

these kinds of FSMs need a huge amount of time to think through.  as in: literally weeks to months.  i've been there (with VBLOCK).

can we please focus the discussion on understanding the v1 Compressed encoding.

Comment 126 Luke Kenneth Casson Leighton 2020-11-30 13:51:03 GMT

(In reply to Alexandre Oliva from comment #116)

> insns.  for 48-bit, ...

this is the 16-bit Compressed bugreport.  48-bit encodings are appropriate to discuss under the 16/32/48/64-bit encoding bugreport.

please can we keep the discussion under the Compressed bugreport to
16-bit Compressed, specifically right now the v1 encoding and making sure that it is understood.

Comment 127 Luke Kenneth Casson Leighton 2020-11-30 16:04:54 GMT

deleted: S/N too high (not relevant). see history.

Comment 128 Cole Poirier 2020-11-30 16:16:10 GMT

deleted: S/N too high (not relevant). see history.

Comment 129 Luke Kenneth Casson Leighton 2020-11-30 17:47:36 GMT

deleted: S/N too high (not relevant). see history.

Comment 130 Luke Kenneth Casson Leighton 2020-11-30 19:13:32 GMT

alexander: ok, i added the FSM to the Appendix, along with a clear and unequivocal separation between Phase 1 and Phase 2, explicitly stating precisely and clearly what their roles and responsibilities are

https://libre-soc.org/openpower/sv/16_bit_compressed/

again to reiterate, i cannot emphasise enough how critically important it is that you understand and accept the distinction between these two phases.

note in particular: ONLY phase 2 looks further at N/M to ascertain if the 16 bit instruction is the 16bit.immediate variant.

this is *NOT* relevant - at all - to Phase 1 because Phase 1's sole exclusive task, to the exclusion of all else, is to determine length+mode, and to communicate that to Phase 2, through a pipeline, in a fowarding-only fashion.

Comment 131 Jacob Lifshay 2020-11-30 19:38:24 GMT

(In reply to Luke Kenneth Casson Leighton from comment #126)
> (In reply to Alexandre Oliva from comment #116)
> 
> > insns.  for 48-bit, ...
> 
> this is the 16-bit Compressed bugreport.  48-bit encodings are appropriate
> to discuss under the 16/32/48/64-bit encoding bugreport.
> 
> please can we keep the discussion under the Compressed bugreport to
> 16-bit Compressed, specifically right now the v1 encoding and making sure
> that it is understood.

48-bit is important to 16-bit compressed because it is another instruction encoding that mostly requires 32/64-bit instructions to be align-2 rather than align-4. This is relevant because one of the proposed compressed instruction schemes would have required 32-bit instructions to be align-4.

Comment 132 Jacob Lifshay 2020-11-30 19:46:57 GMT

(In reply to Luke Kenneth Casson Leighton from comment #130)
> alexander: ok, i added the FSM to the Appendix, along with a clear and
> unequivocal separation between Phase 1 and Phase 2, explicitly stating
> precisely and clearly what their roles and responsibilities are
> 
> https://libre-soc.org/openpower/sv/16_bit_compressed/
> 
> again to reiterate, i cannot emphasise enough how critically important it is
> that you understand and accept the distinction between these two phases.
> 
> note in particular: ONLY phase 2 looks further at N/M to ascertain if the 16
> bit instruction is the 16bit.immediate variant.
> 
> this is *NOT* relevant - at all - to Phase 1 because Phase 1's sole
> exclusive task, to the exclusion of all else, is to determine length+mode,
> and to communicate that to Phase 2, through a pipeline, in a fowarding-only
> fashion.

Note that the efficient decoding example actually splits it into 3 phases (only first 2 included in linked example):
https://libre-soc.org/openpower/sv/16_bit_compressed/decoding/
phase 1:
determine instruction sizes/next encoding modes for all possible current encoding modes at each 16-bit boundary.
phase 2:
do an efficient parallel prefix-sum algorithm O(log N) to determine the encoding mode, starting offset, and instruction size of all instructions in the decode window.
phase 3:
pick the first few instructions and decode them fully.

Comment 133 Alexandre Oliva 2020-11-30 20:13:02 GMT

> the 1st phase *has* to determine length/mode in the absolute most simplest, most basic, most gate-minimalist fashion possible.

I get it.  That's what I was trying to improve on.  But, not knowing I have hardware-design training, someone jumped to the conclusion that it was for software rather than hardware engineering concerns.

> remember always to separate that out in your head from phase two which
is where you further analyse major[.minor][.sub] fields to work out which operation is which.

you and I wish it was so, but with the current design, it's far from it.

not when you think of a single insn, but when you think of the multi-issue muxer, that's quite dramatic, and that's what my observations and suggestions from last night had to do with.

consider, setting aside any 64-bit insns from 3.1, that, to decide the length of the current insn, we have to take into account the current mode (let's call it mode0), and 5 out of the 6 opcode bits (to decide whether it's 10- or 32-bit, in case mode0 is uncompressed) from the half-word at the insn address, say addr

now, to decide the mode of the next insn, say mode1, you have to look at all of these, plus M in case it's a 10-bit insn, and also a switch-back-to-compressed-after bit (*) carried over from insn -1, say SBTCA0 for after 0; whereas when mode0 is compressed, you have to look at M and N, and Cmaj.m to tell whether they have their usual meaning, which adds another 5 bits to the inputs to this decision (not 6 because N overlaps with the major opcode)

my proposal of moving M and Cmaj.m in the compressed encoding enables this decision by the muxer to be made with only 7 wires flowing in, rather than 12.  now, my hardware design training may be rusty and dated, and I may even use wrong terms because I'm translating from my native language, in which I was trained, but I'm pretty sure it still holds that fewer and shorter wires usually save in power consumption and routing complexity, even if the gates remain exactly the same

(*) I haven't seen anything to enable the inspection or preservation of this bit across an interrupt; if the interrupt triggers right after a compressed insn that has NM = 0b10, it is to return to an uncompressed mode insn, but then switch back to compressed mode for the next one.  that has to be represented somewhere that is architecturally visible, saved across context switches, and in signal stack frames.  just like the pre-length queue I proposed last night, which hit me just as I rested my head on the pillow.  lucky for me I don't have to sort that one out any more ;-)

now, to decide the length of insn 1, len1, you need mode1, and 5 bits selected from addr+2 or addr+4, depending on len0.

to feed the decision of mode2, the selection of bits from addr+2 or addr+4 has to extend over an additional 5 bits, as shown for mode1 above.  here, my proposed shift of 16-bit insns saves nearly half of the selectors

len2, similarly, takes mode2 into account, plus 5 opcode bits selected from the half-word selected from among addr+4, addr+6, and addr+8, depending on len0 and len1

finally, the decision of mode3 takes the same 5 bits selected above, plus another 5 bits from the same half-word, unless my proposed rearrangement is taken to reduce the number of selected bits

is the intent clear now?  isn't this hardware engineering reasoning sensible and sound?

Comment 134 Alexandre Oliva 2020-11-30 20:35:20 GMT

>> - what's the use of N, when M is not there, as in the "16-bit mode only"
>> blocks of encodings under 2.1.7 Logical and 2.1.8 Floating Point insns?

> i didn't want to have to duplicate the contents

err, I can't see how your answer relates with the question I intended to ask, it looks like you're answering something unrelated.  I'll rephrase.  If you had got that already, and your answer actually covers that point, I'm afraid I'll need it rephrased too.

let's look at one specific example:

| N | 1 |  RT | | 100.0 | RB  | 0 0 0 | 0 | extsb

the bit where M would usually be is fixed at 0

there's an N bit there, that can presumably be set to 0 or 1

N is a bit that has so far only been assigned a meaning in combination with M

(unlike M, that has specified meanings in the absence of an N, e.g. in 10-bit insns, and in 16-bit st)

what does N stand for, in the absence of an M in the same insn?

Comment 135 Luke Kenneth Casson Leighton 2020-11-30 20:36:23 GMT

alexandre, whilst it is beginning of day for you, cole and jacob, i have been on this issue non-stop for 10 hours straight.

i need a rest so will answer briefly and tersely on key points only, in an effort to allow you, with some thought (before more words), to understand.

(In reply to Alexandre Oliva from comment #133)

> not when you think of a single insn, but when you think of the multi-issue
> muxer, that's quite dramatic, and that's what my observations and
> suggestions from last night had to do with.

the multi-issue muxer is part of Phase 1 (actually, the connector/router between 1 and 2) which Jacob, in comment #132 just before you replied here, successfully implements in an O(log N) algorithm the "indices" that get passed to the muxer.

without going into full detail right now, which i will probably best do in video form, tomorrow, after resting, the issue that you highlight is a non-issue.

the FSM is fine: can i ask you if you could kindly give some thought, after reviewing jacob's multi-issue decoder in comment #132, "why did he say that?" and after resting somewhat i will start again tomorrow with some diagrams.

Comment 136 Luke Kenneth Casson Leighton 2020-11-30 20:45:04 GMT

(In reply to Alexandre Oliva from comment #134)

> what does N stand for, in the absence of an M in the same insn?

i am frazxled.  writing for much too long.

  brief response: read the newly-added Apprndix "Decode phase 2"

N does not exist in 10bit mode because it shares the same space as the EXTNNN.

this is why there are only 2 lines in the table on entry to the FSM (10bit mode)

| 0 | 1234 | 567  8 | 9abcde | f | explanation
| - | ---- | ------ | ------ | - | -----------
| EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
| EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit

however for 16 bit mode N *can* have meaning, giving 4 encoding possibilities


| 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
| 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
| 1 | flds | Cmaj.m | fields | 0 | 16b, 1x v3.0B, 16b
| 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit

in thetables, when you see a column with N in bit 0, and M in bit 15, it means, "this is part 1 of 2nd Stage Decoder".

you will need to look at the pseudocode i wrote which tests combinations of N and M.

Comment 137 Alexandre Oliva 2020-11-30 21:25:19 GMT

luke, please get back to it when you're rested, so that *you* understand

I had not read jakob's 3-phase decoding plan when I wrote today's working out of the actual bit dependencies of len and mode computation for up to 4 insns, but none of that changes the fact that the next-insn-mode computation needs to take almost twice as many bits into account with the current encoding, vs the encoding change I proposed, or even that the current-insn-len computation needs to take 5 out of the 6 primary opcode bits into account

what that arrangement does is to avoid having to work out the dependencies in full, and avoid cascading dependencies.  it trades working out and optimizing the actual dependencies, possibly simplified by O(n) cascading, for O(2^n) parallel-precomputing gates, followed by O(log n) selection and aggregation.

given the exponential growth of the precomputation, worrying about its complexity makes even more sense than in a linear-growth case.  OTOH, the separate selection and aggregation removes a lot of the supposed complexity of handling different-sized insn formats in compressed mode.

so the question becomes, is there a way to enable extended formats in compressed mode, that would make it a lot more useful, without increasing the complexity of the exponential-growth parallel throw-away computation?

it turns out that there is.  one of the outputs of phase 1, for each half word, is a but computed from 5 out of 6 of the opcode bits.  it's used to recognize 10-bit insns.

what if the same bits expressed, in compressed insns, an extended insn?  for phase 1, this is the same computation; for phase 2, it will just change one gate that combines mode and opcode from say and to xor.

Comment 138 Jacob Lifshay 2020-11-30 21:31:29 GMT

(In reply to Alexandre Oliva from comment #137)
> what that arrangement does is to avoid having to work out the dependencies
> in full, and avoid cascading dependencies.  it trades working out and
> optimizing the actual dependencies, possibly simplified by O(n) cascading,
> for O(2^n) parallel-precomputing gates, followed by O(log n) selection and
> aggregation.

Please read the code again, you will find that the total gate count is O(N + N * log N) == O(N * log N), with O(log N) circuit depth.

Comment 139 Jacob Lifshay 2020-11-30 21:35:01 GMT

(In reply to Jacob Lifshay from comment #138)
> (In reply to Alexandre Oliva from comment #137)
> > what that arrangement does is to avoid having to work out the dependencies
> > in full, and avoid cascading dependencies.  it trades working out and
> > optimizing the actual dependencies, possibly simplified by O(n) cascading,
> > for O(2^n) parallel-precomputing gates, followed by O(log n) selection and
> > aggregation.
> 
> Please read the code again, you will find that the total gate count is O(N +
> N * log N) == O(N * log N), with O(log N) circuit depth.

phase 1: predecode O(N) gates, O(1) depth

phase 2: prefix-sum O(N * log N) gates, O(log N) depth

phase 3: not included in example but should be O(N * log N) gate count, O(log N) depth

Comment 140 Alexandre Oliva 2020-11-30 21:38:02 GMT

luke, please don't feel pressured to answer right away.  take some rest first, then read it.  we're clearly miscommunicating, and your being at the end of your day probably contributes to that.

> N does not exist in 10bit mode

irrelevant, that's a 16-bit only encoding, not available in 10-bit insns

> however for 16 bit mode N *can* have meaning, giving 4 encoding possibilities

point is N has only been assigned meanings when bit 15 holds M

this is not the case of the extsb insn I quoted:

| N | 1 |  RT | | 100.0 | RB  | 0 0 0 | 0 | extsb

see the N there at bit 0?  I do.

see M there at bit 15?  I don't.  I see that fixed at 0.

nothing was specified as to the meaning of N in an insn that doesn't have an M at bit 15.

is that zero at bit 15 supposed to be an M?  better write M instead of 0 if so, as elsewhere.

or is that fixed as zero, as a sub-opcode, but it's regardless supposed to be interpreted as an M when combined with N?  if so, better write M=0

or is that N a don't care?  if so, better write that down, instead of naming it N, because N without M is meaningless AFAICT

or does that N without M mean something else, that has so far not been specified?

Comment 141 Alexandre Oliva 2020-11-30 22:09:42 GMT

jacob,

sorry, I misspelled your name before.  I've worked closely with a jakub for some 15 years.

I think we're talking about different Ns, but I think I also made a mistake.

thinking of insns issued per cycle, I was reasoning in terms of possibilities of starting points for insn n: for each insn, the number of possibilities (nearly) doubles, I reasoned.  since the max insn length is a given, and a constant, say M, each increment in n adds at most M half-words to look at in phase 1, so it is O(n) half-words to precompute indeed

it's still a lot of throw-away computation, but it's indeed not even close to as bad as I misconcluded.  thanks for the correction.

I don't see that it invalidates the reasoning, though.

Comment 142 Jacob Lifshay 2020-11-30 22:34:42 GMT

(In reply to Alexandre Oliva from comment #141)
> jacob,
> 
> sorry, I misspelled your name before.  I've worked closely with a jakub for
> some 15 years.

I don't mind too much.

> I think we're talking about different Ns, but I think I also made a mistake.

In my statements N is the maximum number of instruction bytes that can be decoded per cycle, or alternatively, N is the maximum number of instructions that can be decoded per cycle.

Comment 143 Luke Kenneth Casson Leighton 2020-11-30 23:31:45 GMT

(In reply to Alexandre Oliva from comment #140)

feeling better, had some food.

> this is not the case of the extsb insn I quoted:
> 
> | N | 1 |  RT | | 100.0 | RB  | 0 0 0 | 0 | extsb
> 
> see the N there at bit 0?  I do.
> 
> see M there at bit 15?  I don't.  I see that fixed at 0.

ok.  i realised with some thought tgat i think i have a way to explain it.

1)

when you see an N or an M in any given row it means, "ignore this bit completely abd utterly,  because the Phase 1 decoder algorithm has already used it (N or M) to determine the length+mode".


2)

when you see a 1 or a 0 in the N or M positions, it means, "as part of the PHASE TWO decoder process, please take these bits into account as part of the **OPCODE** identification process"

one example of such is:

* N=1 & M=1 & Cmaj.m!=0b001.1 => the opcodes are translated according to the "immediates" section.

now, i am looking at bit 15 of this

 | N | 1 |  RT | | 100.0 | RB  | 0 0 0 | 0 | extsb

and am wondering, myself, why on earth it is set to 0 rather than M.

i will have to do a full examination of the entire encoding table to work out why, and get back to you.
sigh.

Comment 144 Alexandre Oliva 2020-11-30 23:47:59 GMT

thanks, I'll be very glad if the outcome of this question is that N without M is just a mistake in the tables.  There are occurrences of them in the 16-bit only tables under 2.1.6, 2.1.7 and 2.1.8.

If it wasn't a mistake, 2.1.1 would probably have to have extra lines in the encoding formats table covering these cases of M-less Ns, and that would make for additional complexity in both phase 1 and phase 2 of the decoder.

Comment 145 Alexandre Oliva 2020-12-01 00:01:07 GMT

looking at 4.4 now

it defines:

M: insn[0]
N: insn[15]

whereas 2.1.1 had them reversed:

| E01      | Cmaj.m | fld1     | fld2     | M | 10b
[...]
| N | immf | Cmaj.m | fld1     | fld2     | M | 16b

this makes the code snippets afterwards very confusing,
as they uses N from 10-bit insns that only have M,
and M and N with reversed codes.

Though each section defines M and N independently,
they're both individually self-consistent, but mutually confusing.
IMHO it would be desirable to make them mutually self-consistent too.

Comment 146 Luke Kenneth Casson Leighton 2020-12-01 00:10:04 GMT

(In reply to Alexandre Oliva from comment #144)
> thanks, I'll be very glad if the outcome of this question is that N without
> M is just a mistake in the tables.

yep looks that way to me.

>  There are occurrences of them in the
> 16-bit only tables under 2.1.6, 2.1.7 and 2.1.8.

yep, logical, fp, arith.  all cut/paste errors from... 8 days ago?

i've corrected them, please review?

> If it wasn't a mistake, 2.1.1 would probably have to have extra lines in the
> encoding formats table covering these cases of M-less Ns, and that would
> make for additional complexity in both phase 1 and phase 2 of the decoder.

yep, which i went to some trouble is distinctly undesirable in phase 1.

phase 2 not so bothered (ok if it gets hopelessly convoluted that's bad)

Comment 147 Alexandre Oliva 2020-12-01 00:11:35 GMT

in the pseudo code of phase 1, since we're talking about two separate insns, it would be desirable to make it explicit whether the Ms, Ns, and extc_id that appear there are taken from previ or from nexti.

I'm also missing code to handle previ.mode == v3.0B_then_16bit.  it's particularly important to specify how the conflict is resolved when nexti is a 10-bit insn: does the return-to-16-bit direction prevail over the M bit in nexti, or vice-versa, or is that conflict to be rejected?

Comment 148 Luke Kenneth Casson Leighton 2020-12-01 00:15:25 GMT

(In reply to Alexandre Oliva from comment #145)
> looking at 4.4 now
> 
> it defines:
> 
> M: insn[0]
> N: insn[15]
> 
> whereas 2.1.1 had them reversed:

good catch.  inverted and corrected in the pseudocode


> Though each section defines M and N independently,
> they're both individually self-consistent, but mutually confusing.
> IMHO it would be desirable to make them mutually self-consistent too.

not sure what you mean, can you illustrate?

(as long as, at this stage, that doesn't involve duplicating and splitting 16/10 into separate tables)

Comment 149 Alexandre Oliva 2020-12-01 00:16:49 GMT

aah, much better now.  I see M and N are no longer defined differently in 4.4,  thanks!  alas, the pseudocode still has them reversed.

Comment 150 Alexandre Oliva 2020-12-01 00:18:14 GMT

> not sure what you mean, can you illustrate?

I just meant that it would be desirable to assign them the same meanings in both subsections, as you just did (except for the pseudocode)

Comment 151 Luke Kenneth Casson Leighton 2020-12-01 00:24:55 GMT

(In reply to Alexandre Oliva from comment #147)
> in the pseudo code of phase 1, since we're talking about two separate insns,
> it would be desirable to make it explicit whether the Ms, Ns, and extc_id
> that appear there are taken from previ or from nexti.

agreed.

they are taken from the insn which is the current insn i.e. associated with nexti.

this can be made clear, i feel, by creating a variable "curi = nexti" and renaming insn to cur_insn


> I'm also missing code to handle previ.mode == v3.0B_then_16bit.  

yes.  i know i will get this wrong when not having actual code to execute with unit tests so left it for now

> it's
> particularly important to specify how the conflict is resolved when nexti is
> a 10-bit insn: does the return-to-16-bit direction prevail over the M bit in
> nexti, or vice-versa, or is that conflict to be rejected?

don't know.

one school of thought would say "raise an illegal exception", another would say "allow it", yet another would say, "use it as an escape-sequence of some kind".

i honestly have no idea which is best, except at the level of being not in favour of complicating the Phase 1 FSM.

on that alone, "allow it" wins because even an illegal exception is extra gates.

but... is that good? i don't know.  thoughts?

Comment 152 Jacob Lifshay 2020-12-01 00:25:40 GMT

(In reply to Alexandre Oliva from comment #147)
> in the pseudo code of phase 1, since we're talking about two separate insns,
> it would be desirable to make it explicit whether the Ms, Ns, and extc_id
> that appear there are taken from previ or from nexti.
> 
> I'm also missing code to handle previ.mode == v3.0B_then_16bit.  it's
> particularly important to specify how the conflict is resolved when nexti is
> a 10-bit insn: does the return-to-16-bit direction prevail over the M bit in
> nexti, or vice-versa, or is that conflict to be rejected?

In the example decoder code I wrote and in the compressed encoding demo:
Standard-for-1-then-Compressed Mode has priority over the swap to compressed bit. I think that option is better for consistency (since standard-then-compressed can be thought of as an override) though best is probably trapping on mismatch.

Comment 153 Luke Kenneth Casson Leighton 2020-12-01 00:30:06 GMT

(In reply to Luke Kenneth Casson Leighton from comment #151)

> on that alone, "allow it" wins because even an illegal exception is extra
> gates.
> 
> but... is that good? i don't know.  thoughts?

wait.... this is what jacob was talking about, if you have a branch to the 10bit insn it should be "ok".  but, you might have an insn before it that is 16bit.

is that right jacob?

sequence:

    16bit sets next as: v3.0b-then-16
loop:
    EXT001=Compressed i.e. 10bit
    16bit whatevver

here there is no problem to jump to "loop" point, or to pass *through* loop from the 16bit opcode *before* it.

my feeling is that this should be perfectly legal i.e. ignore the fact that the v3.0B instruction is a 10bit, just go with it.

Comment 154 Jacob Lifshay 2020-12-01 00:31:00 GMT

(In reply to Luke Kenneth Casson Leighton from comment #151)
> i honestly have no idea which is best, except at the level of being not in
> favour of complicating the Phase 1 FSM.
> 
> on that alone, "allow it" wins because even an illegal exception is extra
> gates.

phase 1 wouldn't require any extra gates in any case since it only cares about mode and length, not illegal/legal.

the illegal operation is decoded by the full decoder which is given the mode and the instruction bits, it can detect the instruction bits being 16-bit with next mode standard and the mode being standard-then-compressed. I think that would take 2-3 gates total.

Comment 155 Luke Kenneth Casson Leighton 2020-12-01 00:32:55 GMT

(In reply to Jacob Lifshay from comment #152)

> In the example decoder code I wrote and in the compressed encoding demo:
> Standard-for-1-then-Compressed Mode has priority over the swap to compressed
> bit. I think that option is better for consistency (since
> standard-then-compressed can be thought of as an override) though best is
> probably trapping on mismatch.

bit of crossover here with comment #153, can you review the reasoning and see if there are other (better) scenarios / instruction sequences?

Comment 156 Luke Kenneth Casson Leighton 2020-12-01 00:34:29 GMT

(In reply to Jacob Lifshay from comment #154)

> phase 1 wouldn't require any extra gates in any case since it only cares
> about mode and length, not illegal/legal.

oh duh of course.  well spotted, that keeps P1 simpler.

Comment 157 Luke Kenneth Casson Leighton 2020-12-01 00:37:49 GMT

(In reply to Alexandre Oliva from comment #149)
> aah, much better now.  I see M and N are no longer defined differently in
> 4.4,  thanks!  alas, the pseudocode still has them reversed.

arse.  sorted.

(In reply to Alexandre Oliva from comment #150)
> > not sure what you mean, can you illustrate?
> 
> I just meant that it would be desirable to assign them the same meanings in
> both subsections, as you just did (except for the pseudocode)

ahh ok.  what the heck i am doing working in computing when i keep getting things inverted, i honestly don't know :)

Comment 158 Jacob Lifshay 2020-12-01 00:40:03 GMT

(In reply to Luke Kenneth Casson Leighton from comment #153)
> my feeling is that this should be perfectly legal i.e. ignore the fact that
> the v3.0B instruction is a 10bit, just go with it.

yeah, that should be legal.

I think the issue was detecting the case:
1. 16-bit compressed next=standard-then-compressed
2. 10-bit next=standard
3. what mode are we now?

I'm proposing the mode at 3 should either be compressed or 2 should trap (trapping preferred).

The following case should be unquestionably legal:
1. 16-bit compressed next=standard-then-compressed
loop:
2. 10-bit next=compressed
3. 16-bit compressed next=...

Comment 159 Alexandre Oliva 2020-12-01 00:41:46 GMT

2.1.4 states branches always branch to a 4-byte aligned address

I thought it was also the case that branch targets also had to be in uncompressed mode, but that unconditional branches have M and even N suggest otherwise:

| N | offs2   | | 000.LK | offs!=0        | M | b, bl

as for conditional branches, do M and N apply whether or not the branch is taken?

| N | BO3 BI3 | | 001.0  | LK BI | BO     | M | bclr, bclrl

what about when they're not there?  M=N=1 suggest immediate mode, that's always followed by another uncompressed insn:

| 1 | offs2   | | 000.LK | BI    | BO1 oo | 1 | bc, bcl

but the LK variants would suggest function calls, targeting entry points in uncompressed mode.  should they imply a switch to uncompressed mode at least when the branch is taken?

Comment 160 Alexandre Oliva 2020-12-01 00:45:37 GMT

comment 159, on branch insns, was unrelated to the ongoing discussion, FWIW

having the return-to-compressed override is bad IMHO

if one bit must prevail, it had better be one that gets the next insn decoded the same whether you got to the 10-bit in by falling-through or jumping to the 10-bit insn

this speaks against trapping, too, unless you want to trap because it's an unintended fallthrough.  yuck

Comment 161 Jacob Lifshay 2020-12-01 00:56:34 GMT

(In reply to Alexandre Oliva from comment #159)
> 2.1.4 states branches always branch to a 4-byte aligned address
> 
> I thought it was also the case that branch targets also had to be in
> uncompressed mode,

yes

> but that unconditional branches have M and even N suggest
> otherwise:
> 
> | N | offs2   | | 000.LK | offs!=0        | M | b, bl

I'd guess not much thought was given to that and N and M are there just for consistency, simplifying the decoder.

> as for conditional branches, do M and N apply whether or not the branch is
> taken?
> 
> | N | BO3 BI3 | | 001.0  | LK BI | BO     | M | bclr, bclrl

I'd say they apply only to untaken branches.

> what about when they're not there?  M=N=1 suggest immediate mode, that's
> always followed by another uncompressed insn:
> 
> | 1 | offs2   | | 000.LK | BI    | BO1 oo | 1 | bc, bcl
> 
> but the LK variants would suggest function calls, targeting entry points in
> uncompressed mode.  should they imply a switch to uncompressed mode at least
> when the branch is taken?

I would expect switching to uncompressed mode due to the branch. The return instruction would also switch to uncompressed mode.

Comment 162 Jacob Lifshay 2020-12-01 00:58:59 GMT

(In reply to Alexandre Oliva from comment #160)
> comment 159, on branch insns, was unrelated to the ongoing discussion, FWIW
> 
> having the return-to-compressed override is bad IMHO
> 
> if one bit must prevail, it had better be one that gets the next insn
> decoded the same whether you got to the 10-bit in by falling-through or
> jumping to the 10-bit insn
> 
> this speaks against trapping, too, unless you want to trap because it's an
> unintended fallthrough.  yuck

It would trap due to an illegal combination, running the illegal instruction handler.

Comment 163 Jacob Lifshay 2020-12-01 01:16:09 GMT

(In reply to Jacob Lifshay from comment #162)
> (In reply to Alexandre Oliva from comment #160)
> > this speaks against trapping, too, unless you want to trap because it's an
> > unintended fallthrough.  yuck
> 
> It would trap due to an illegal combination, running the illegal instruction
> handler.

(In reply to Jacob Lifshay from comment #158)
> I think the issue was detecting the case:
> 1. 16-bit compressed next=standard-then-compressed
> 2. 10-bit next=standard
> 3. what mode are we now?
> 
> I'm proposing the mode at 3 should either be compressed or 2 should trap
> (trapping preferred).

Note that trapping would occur at instruction 2, where the illegal combination was detected. it does not occur at 3, which would just be confusing.

An alternate plan is to just trap on any 10-bit instruction when the current mode is standard-then-compressed (at location 2), since the previous instruction (at location 1) could have instead switched to just standard mode instead of standard-then-compressed mode.

Comment 164 Alexandre Oliva 2020-12-01 01:33:46 GMT

it occurred to me that the reversal of M and N might have contributed to our miscommunication in my question about N without M.

now here's another bit that may be contributing to our miscommunication, and that it would be good to clear up, if nothing else because it might be hiding a major problem in the encoding design

I hear that the mode and length decoder, namely phase 1, only have to look at bits M and N of compressed insns to determine length of current insn and mode of the next, and that's what's pseudocoded.

me, I see no way to avoid also looking at Cmaj.m bits to tell whether M and N have the usual meaning, or whether we're looking at a '16b sub' insn, where the bit that usually holds N serves a different purpose.

I build my case on the following opcodes:

| 0111 | BA2 | | 001.1 | 0 BA | BB  | M | crnand
| 1000 | BA2 | | 001.1 | 0 BA | BB  | M | crand

if the decoder naively looks at the bits usually assigned to N and M, namely 0 and 15, I think it could make decisions that AFAICT would be incorrect for subsequent insns

my assumption is that, for these insns depicted under 2.1.9, N is absent, so only M should take effect, as in the 10-bit encoding.  IMHO it would be surprising, bordering nonsensical, for part of the subopcode to force some of the cr ops (like crand, whose "N" is 1) to choose between only be followed by 32-bit then 16-bit or remain in 16-bit mode, while others (like crnand, whose "N" is 0) have to choose between permanent 32- or 16-bit.

now, consider a 16-bit crand insn is to be followed by several uncompressed insns, so it has M=0

the phase 1 decoder, however, looks at M and N bits and sees NM = 0b10, meaning 32-bit-then-16-bit

so the next insn will be decoded as an uncompressed 32-bit one, just as expected

but after it, we'll return to compressed mode, which is (to me) not expected

thus I claim the phase 1 decoder has to look at Cmaj.m, in addition to M and N, to tell that the bit that usually means N doesn't really mean that for that opcode.

is that really not so, i.e., are cr opcodes really to have their bit 0 interpreted as N even though it's also part of the subopcode?

Comment 165 Alexandre Oliva 2020-12-01 01:46:29 GMT

> could have instead switched to just standard mode instead of standard-then-compressed mode.

unless it's a crand and I'm wrong in comment 164 ;-)

Comment 166 Alexandre Oliva 2020-12-01 01:55:09 GMT

> what the heck i am doing working in computing when i keep getting things inverted

computing would be quite limited without inverters ;-)

Comment 167 Luke Kenneth Casson Leighton 2020-12-01 01:55:55 GMT

(In reply to Alexandre Oliva from comment #164)
l
> me, I see no way to avoid also looking at Cmaj.m bits to tell whether M and
> N have the usual meaning, or whether we're looking at a '16b sub' insn,
> where the bit that usually holds N serves a different purpose.
> 
> I build my case on the following opcodes:
> 
> | 0111 | BA2 | | 001.1 | 0 BA | BB  | M | crnand
> | 1000 | BA2 | | 001.1 | 0 BA | BB  | M | crand

these are marked as 16bit only.  i.e. it is not possible to reach these without first going through one 10bit insn and into 16bit mode.


 
> now, consider a 16-bit crand insn is to be followed by several uncompressed
> insns, so it has M=0

... here yes you are right, we need to nake a choice, should the return be to v3.0B or to v3.0B-then-back-to-16



> is that really not so, i.e., are cr opcodes really to have their bit 0
> interpreted as N even though it's also part of the subopcode?

no not at all, that's why they don't have "N" in them.

i was looking for a way to jam cr ops into Compressed because they will be used for common predication operations in SV.

we do not have enough info to say if the mode should be implicitly 3.0b or 3.0bthen16 when M=0

i will make a TODO note.

Comment 168 Alexandre Oliva 2020-12-01 02:06:10 GMT

> no not at all, that's why they don't have "N" in them.

do you realize this implies phase 1 decoder will have to look at all 4 bits of Cmaj.m to tell whether or not the encoding has an N?

and, if there are to be 16-bit insns that have M but not N, we need to explicitly specify that M by itself takes the same meaning it does in 10-bit insns, instead of leaving that implicit.

Comment 169 Luke Kenneth Casson Leighton 2020-12-01 02:22:22 GMT

(In reply to Luke Kenneth Casson Leighton from comment #167)

> we do not have enough info to say if the mode should be implicitly 3.0b or
> 3.0bthen16 when M=0
> 
> i will make a TODO note.

nnope.  couldn't stand it.

instead, i cut the brownfield encoding space in half, by shifting the bits 0-4 over by one, making room for N.

this makes Cmaj.m=0b001.1 uniform and consistent when it comes to observing M and N, at the expense of halving the unallocated space in this weird sub-sub-space.

it's a reasonable tradeoff.

the downside is that crops can only target CR0-3 now.

Comment 170 Luke Kenneth Casson Leighton 2020-12-01 02:24:52 GMT

(In reply to Alexandre Oliva from comment #168)
> > no not at all, that's why they don't have "N" in them.
> 
> do you realize this implies phase 1 decoder will have to look at all 4 bits
> of Cmaj.m to tell whether or not the encoding has an N?

this is probably why i just reorg'd it, crossover, comment #169.  can you check again?

end of day here, leave it with you.

Comment 171 Alexandre Oliva 2020-12-01 03:02:58 GMT

cool, now the only weirdness left WRT M and N is in 2.1.3.  did you just miss those, or are there other reasons why they can/should be like that?

have a good one!

Comment 172 Luke Kenneth Casson Leighton 2020-12-02 13:33:32 GMT

(In reply to Alexandre Oliva from comment #171)
> cool, now the only weirdness left WRT M and N is in 2.1.3.  did you just
> miss those, or are there other reasons why they can/should be like that?

i missed reviewing them. did a minor reorg.

1) illegal, attn and nop are special cases that, in all cases, the Phase 1 can ignore.

2) one nop encoding had been missed: N=1,M=1 which is effectively named "nop.immediate" and that has been added now

3) at Phase 2 a decision can be made to re-analyse M and N and if M=0,N=0 throw an illegal instruction trap, otherwise for all other N,M it's a "nop".

4) there were some spare brownfield encodings which are 16bit mode only, enough for two instructions that takes a nonzero register.  i decided to make these mtcr and mfcr (moved into System section) and freed up some brownfield space formerly taken up by those.

Comment 173 Alexandre Oliva 2020-12-02 22:20:12 GMT

in comment 159, I asked what mode branch targets had to be in

we have to keep in mind that bl will eventually make the next insn a branch target for blr, so the insn after bl also has to satisfy the properties of branch targets.  therefore branch-and-link doesn't need for M or N ever: the (statically) next insn must be in uncompressed mode.

we also have to look into the implications of what the linker may have to change, both in the b-and-l and in the next insn, to see whether our bl insn can be successfully modified by the linker, if it needs to do that

Comment 174 Alexandre Oliva 2020-12-02 22:28:02 GMT

I'm also getting very concerned about how to preserve the switch-back-to-compressed-mode state across interrupts.  I can't work out how to do it.

if there's a register that holds this bit, it would have to be saved across the interrupt, but by the time we're running the interrupt handler code that would preserve it, it would already be overwritten

whereas if it's architecturally saved in another register by the architecture's start-interrupt behavior, then that save register might be overwritten by another interrupt before the first handler gets a chance to save it

lots of assumptions here, that might not hold, but we have to have that worked out and documented IMHO

Comment 175 Luke Kenneth Casson Leighton 2020-12-02 23:02:31 GMT

(In reply to Alexandre Oliva from comment #174)
> I'm also getting very concerned about how to preserve the
> switch-back-to-compressed-mode state across interrupts.  I can't work out
> how to do it.

it's ok it goes into MSR which is already saved by interrupts (SRR1) so no big deal.

yes it must all be documented.

Comment 176 Luke Kenneth Casson Leighton 2022-06-25 10:05:06 BST

we did a lot of work on this one, ultimately the decision was not
to proceed with it (right now) however it was still a lot of work
that would form the basis of a future Compressed Standard.

Comment 177 Jacob Lifshay 2022-07-05 02:02:37 BST

Marking this resolved, since all the work which we are going to do in the scope of NLNet.2019.10.046.Standards is completed.