Bug 532 - discuss iterative approach for statistical analysis of effectiveness of Compressed PowerISA encoding
Summary: discuss iterative approach for statistical analysis of effectiveness of Compr...
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 238
  Show dependency treegraph
 
Reported: 2020-11-20 13:49 GMT by Luke Kenneth Casson Leighton
Modified: 2020-11-29 13:05 GMT (History)
3 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments
here's the hack that adds -M histogram to ppc objdump -d (4.39 KB, patch)
2020-11-21 15:28 GMT, Alexandre Oliva
Details | Diff
python program for insn histogram with per-operand counts (1.87 KB, text/plain)
2020-11-23 05:41 GMT, Alexandre Oliva
Details
python script skeleton to estimate compression for a given insn selection (5.67 KB, text/x-python)
2020-11-28 23:27 GMT, Alexandre Oliva
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-11-20 13:49:41 GMT
we need a way to get rapid feedback on the effectiveness of the Compressed ISA encodings, so that they can be designed properly.

the simplest approach: short bash scripts that analyse objdump disassembly, followed by counting instructions.

however this starts to get complicated when the limited range of immediates kicks in, and it gets even more complicated when the register allocation could begin to best be redone, i.e. the existence of Compressed begins to affect which instruction patterns *should* have been used.

what tools are best for the job here?
Comment 1 Luke Kenneth Casson Leighton 2020-11-21 14:45:57 GMT
from aoliva

Here's a hack that adds 'histogram' as a ppc disassembler option.  It
makes separate buckets according to binutils' own notion of ppc opcodes.
I'm not sure that's the most useful way to go about it, but it ought to
do as a first cut.

I suppose counting significant bits of immediate and offset operands
could be useful, but I haven't implemented that yet.

I imagine that it might make sense to count branches in a single bucket.

Any other thoughts on what else would be useful to have?


$ ./objdump -M histogram -d ppc-gcc | sed '1,/^Opcode histogram/d' | sort -n | tail
    3386 cmpdi (2c200000)
    4306 nop (60000000)
    4404 beq (41820000)
    5219 b (48000000)
    5715 bl (48000001)
    8957 li (38000000)
    9205 addi (38000000)
   10167 std (f8000000)
   10718 mr (7c000378)
   18828 ld (e8000000)
Comment 2 Luke Kenneth Casson Leighton 2020-11-21 14:54:21 GMT
(In reply to Luke Kenneth Casson Leighton from comment #1)
> from aoliva
> 
> Here's a hack that adds 'histogram' as a ppc disassembler option. 

do add the attachment here when you have a moment

> It
> makes separate buckets according to binutils' own notion of ppc opcodes.

there is a case for both the pseudocode histogram (beq) and for the non-aliased version (bc)

one will tell us the number of occurrences of the different CR usage (BO and BI fields), the other tells us in general how many branch-conditionals there are

> I'm not sure that's the most useful way to go about it, but it ought to
> do as a first cut.
> 
> I suppose counting significant bits of immediate and offset operands
> could be useful, but I haven't implemented that yet.
> 
> I imagine that it might make sense to count branches in a single bucket.
> 
> Any other thoughts on what else would be useful to have?

register number usage.  how many of each register is used.

it may help also to break that down by opcode.
Comment 3 Alexandre Oliva 2020-11-21 15:28:02 GMT
Created attachment 120 [details]
here's the hack that adds -M histogram to ppc objdump -d
Comment 4 Alexandre Oliva 2020-11-21 15:34:30 GMT
I didn't think breaking things down by registers was of much relevance, since register assignments are almost accidental.  That a certain register ends up being used doesn't mean an allocation to fit a smaller register file wouldn't fit.  So I don't see what useful information would follow from that break-down.  If you see something I don't, please educate me.
Comment 5 Alexandre Oliva 2020-11-21 15:41:26 GMT
-M histogram,raw will get buckets broken down by raw opcode rather than aliases. does this get you what you want WRT bc vs beq?
Comment 6 Luke Kenneth Casson Leighton 2020-11-21 16:07:49 GMT
(In reply to Alexandre Oliva from comment #5)
> -M histogram,raw will get buckets broken down by raw opcode rather than
> aliases. does this get you what you want WRT bc vs beq?

ahh that's it.  star.

(In reply to Alexandre Oliva from comment #4)

> wouldn't fit.  So I don't see what useful information would follow from that
> break-down.  If you see something I don't, please educate me.

parent bug or #213.  can't find it straight away.  ppc64 ABI uses specific regs for certain purposes.  targetting those makes the process of identifying the 32 bit instructions to replace with direct 16 bit equivalents that much easier.

stack pointer is an obvious candidate.

the idea here is not to go "hey let's make this so fantastically idealistically complex that we force ourselves to spend 5 manyears on compiler development".

we have not got time to mess about like that.

therefore "what regs are most commonly used" is a good first effort such that a straight 1-pass non-state-based replacement algorithm can be dropped in with very little effort and give us a huge drop in code size.

later on heavy optimisation can take place by other people.
Comment 7 Luke Kenneth Casson Leighton 2020-11-21 18:33:23 GMT
so, logical deduction time.  question: how do we reasonably demonstrate an approximate compression ratio given that:

a) the C instructions are a common subset
b) the accessible registers likewise.

answer: without near-total-rewrite of gcc llvm binutils, some simple one-line no-state pattern-matching will suffice:

  foreach insn(objdump):
    if not recognised opcode
       continue
    if not regs recognised by replacement
       continue

this is made slightly more complex by it needing to recognise function name boundaries (to preserve glibc6 ABI semantics), as well as recognise the entry and exit from 10bit, 16bit and v3.0B 32bit modes.
Comment 8 Luke Kenneth Casson Leighton 2020-11-21 19:49:10 GMT
(In reply to Luke Kenneth Casson Leighton from comment #7)

> answer: without near-total-rewrite of gcc llvm binutils, some simple
> one-line no-state pattern-matching will suffice:
> 
>   foreach insn(objdump):
>     if not recognised opcode
>        continue
>     if not regs recognised by replacement
>        continue

next question: what's the most likely rapid way to do this assessment.

* bash script involving sed regexs that can only be understood by someone who is hyperintelligent and non-dyslexic?

   (answer is self-evident from question)

* c program or actual modification of binutils?

   answer: it's possibly a little more extensive than the histogram patch however  might be surprisingly easy

* 50 line non-OO throwaway python script that does line-by-line pattern recognition?

  answer: high probability of being least amount of work for highest clarity.
Comment 9 Alexandre Oliva 2020-11-21 21:12:18 GMT
FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about fixed registers (assigned to special purposes) and register allocation order:

Special-purpose registers on ppc are:

r0: constant zero/throw-away
r1: stack pointer
r2: thread-local storage pointer in 32-bit mode
r2: non-minimal TOC register
r10: EH return stack adjust register
r11: static chain pointer
r13: thread-local storage pointer in 64-bit mode
r30: minimal-TOC/-fPIC/-fpic base register
r31: frame pointer
lr: return address register

the register allocation order in GCC (i.e., it takes the earliest available register that fits the constraints) is:

   We allocate in the following order:
	fp0		(not saved or used for anything)
	fp13 - fp2	(not saved; incoming fp arg registers)
	fp1		(not saved; return value)
	fp31 - fp14	(saved; order given to save least number)
	cr7, cr5	(not saved or special)
	cr6		(not saved, but used for vector operations)
	cr1		(not saved, but used for FP operations)
	cr0		(not saved, but used for arithmetic operations)
	cr4, cr3, cr2	(saved)
	r9		(not saved; best for TImode)
	r10, r8-r4	(not saved; highest first for less conflict with params)
	r3		(not saved; return value register)
	r11		(not saved; later alloc to help shrink-wrap)
	r0		(not saved; cannot be base reg)
	r31 - r13	(saved; order given to save least number)
	r12		(not saved; if used for DImode or DFmode would use r13)
	ctr		(not saved; when we have the choice ctr is better)
	lr		(saved)
	r1, r2, ap, ca	(fixed)
	v0 - v1		(not saved or used for anything)
	v13 - v3	(not saved; incoming vector arg registers)
	v2		(not saved; incoming vector arg reg; return value)
	v19 - v14	(not saved or used for anything)
	v31 - v20	(saved; order given to save least number)
	vrsave, vscr	(fixed)
	sfp		(fixed)
Comment 10 Alexandre Oliva 2020-11-21 21:53:17 GMT
modifying binutils so as to count opcodes that we'd like to compress could be done in basically two ways:

- add to include/opcode/ppc.h:
#define PPC_OPCODE_COMPRESS 0x800000000000ull

- in opcodes/ppc-opc.c, mark the opcodes under powerpc_opcodes that are candidate for compression with '| PPC_OPCODE_COMPRESS' in FLAGS (the fourth column)

- in opcodes/ppc-dis.c, modify increment_opcode_counter to check for this flag, apply any other constraints on the operands, and count the instruction as compressed if it fits

This would require a quick recompilation for every experiment.

A more dynamic alternative would add a disassembler option or environment variable or whatever, handled in powerpc_init_dialect, enabling a file name to be specified that contains opcodes to regard as compressed (subject to the same constraints)

I'm not sure the latter effort is worth the additional convenience (if any) it would provide.  WDYT?

As for the extra constraints for insns to fit, that's where I haven't quite grasped what's proposed yet to be able to generalize the transformations and limits from 32-bit to 16- (and/or 10-?) bit opcodes, if there is a general rule at all.  I'm still trying to get acquainted with the preexisting opcode formats of PowerISA (do we have a PDF for 3.1? the official site insists that I bring down my Javascript firewalls against web blobs to download it, and I'd rather not) to be better able to make sense of https://libre-soc.org/openpower/sv/16_bit_compressed/ (that I see has changed since I first went through it; moving target much? :-)
Comment 11 Luke Kenneth Casson Leighton 2020-11-21 22:19:25 GMT
(In reply to Alexandre Oliva from comment #9)
> FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about fixed
> registers (assigned to special purposes) and register allocation order:
> 
> Special-purpose registers on ppc are:
> 
> r0: constant zero/throw-away
> r1: stack pointer
> r2: thread-local storage pointer in 32-bit mode
> r2: non-minimal TOC register
> r10: EH return stack adjust register
> r11: static chain pointer
> r13: thread-local storage pointer in 64-bit mode
> r30: minimal-TOC/-fPIC/-fpic base register
> r31: frame pointer

ah! that's what i thought was SP. excellent.  so we need to find out what the range is on the immediate, for this one.  it's used a *lot*, same as SP.


> lr: return address register
> 
> the register allocation order in GCC (i.e., it takes the earliest available
> register that fits the constraints) is:
> 
>    We allocate in the following order:
> 	fp0		(not saved or used for anything)
> 	fp13 - fp2	(not saved; incoming fp arg registers)
> 	fp1		(not saved; return value)
> 	fp31 - fp14	(saved; order given to save least number)

these aren't contiguous, makes deciding the priority for C allocation something of a pain.  hence the reason for seeing actual usage.


> 	r9		(not saved; best for TImode)
> 	r10, r8-r4	(not saved; highest first for less conflict with params)

at least these should be mappable

> 	r3		(not saved; return value register)

and this
Comment 12 Luke Kenneth Casson Leighton 2020-11-21 22:41:38 GMT
(In reply to Alexandre Oliva from comment #10)
> modifying binutils so as to count opcodes that we'd like to compress could
> be done in basically two ways:
> 
> - add to include/opcode/ppc.h:
> #define PPC_OPCODE_COMPRESS 0x800000000000ull
> 
> - in opcodes/ppc-opc.c, mark the opcodes under powerpc_opcodes that are
> candidate for compression with '| PPC_OPCODE_COMPRESS' in FLAGS (the fourth
> column)
> 
> - in opcodes/ppc-dis.c, modify increment_opcode_counter to check for this
> flag, apply any other constraints on the operands, and count the instruction
> as compressed if it fits
> 
> This would require a quick recompilation for every experiment.

and that means if trying to compare those side-by-side it's very challenging.  whereas a bunch of throwaway 50 line external python scripts is trivial to rerun and compare.

> I'm not sure the latter effort is worth the additional convenience (if any)
> it would provide.  WDYT?

50 line python scripts reading stdin piped from objdump raw seems a lot easier.
 
> As for the extra constraints for insns to fit, that's where I haven't quite
> grasped what's proposed yet to be able to generalize the transformations and
> limits from 32-bit to 16- (and/or 10-?) bit opcodes, if there is a general
> rule at all.

OpenPOWER was never designed for dropping 16 bit on top of it.

RISCV RVC was ground-up designed for mixed 16/32.

we need to make some compromises: the 10 bit format makes best use of the remaining bits whilst transitioning to "full" 16 bit mode.

the absolute simplest general rule is literally that from https://bugs.libre-soc.org/show_bug.cgi?id=532#c7

at this stage we really do not want to get into the knock-on effects of register allocation, effect on branches, and so on.  anything which, by the very *existence* of Compressed, would result in the compiler emitting radically altered assember, is *not* in scope at this stage: thst is too early.

however i would like to find out:

* word-aligned SP/FP immediates is enough
  (for 32 bit LD/ST)

  this confirms whether shifting the
  immediate by 2 bits is ok

* how many functions have 3, 4, 5 regs as
  parameters

* what sort of range of immediates are
  there on cmpwi and cmpdi

  are they mostly low (small) or are they
  large.  are they mostly against zero

* which CRs are mostly used by branches



>  I'm still trying to get acquainted with the preexisting opcode
> formats of PowerISA (do we have a PDF for 3.1?

ah we are implementing v3.0B.  search "raptorcs wiki v3.0b OpenPOWER spec" should come up top hit, no JS.


> and I'd rather not) to be better able to make sense of
> https://libre-soc.org/openpower/sv/16_bit_compressed/ (that I see has
> changed since I first went through it; moving target much? :-)

:) i use it to capture ongoing discussions.  if i don't they are lost pretty much immediately.


to get a rapid overview of OPF v3.0B without needing to go through a 1300 page PDF try these:

https://libre-soc.org/openpower/isatables/
https://libre-soc.org/openpower/isa/

bear in mind FP is not added yet so not included in those tables.
Comment 13 Jacob Lifshay 2020-11-22 01:37:58 GMT
(In reply to Alexandre Oliva from comment #10)
> I'm still trying to get acquainted with the preexisting opcode
> formats of PowerISA (do we have a PDF for 3.1? the official site insists
> that I bring down my Javascript firewalls against web blobs to download it,
> and I'd rather not)

https://wiki.raptorcs.com/w/images/f/f5/PowerISA_public.v3.1.pdf

That should be a direct download link.
Comment 14 Alexandre Oliva 2020-11-23 05:41:48 GMT
Created attachment 121 [details]
python program for insn histogram with per-operand counts

Here's a python script that counts occurrences of raw opcodes (getting the same totals as with the objdump -M histogram patch) and, within them, counts occurrences of each operand (registers, conditions, and bit-lengths of immediates, offsets, and branch ranges).
Comment 15 Luke Kenneth Casson Leighton 2020-11-24 01:05:39 GMT
(In reply to Alexandre Oliva from comment #14)
> Created attachment 121 [details]
> python program for insn histogram with per-operand counts
> 
> Here's a python script that counts occurrences of raw opcodes (getting the
> same totals as with the objdump -M histogram patch) and, within them, counts
> occurrences of each operand (registers, conditions, and bit-lengths of
> immediates, offsets, and branch ranges).

ooo i looove it.  i've added it to the ikiwiki git repo
https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/insn-histogram.py;hb=HEAD

i also did some augmentations to get global subtotals for int regs and
Condition Registers.  the numbers are quite fascinating.
Comment 16 Alexandre Oliva 2020-11-28 23:27:37 GMT
Created attachment 122 [details]
python script skeleton to estimate compression for a given insn selection

This script is a skeleton to experiment with the selection of insns for compressed encoding, so far only supporting the 8-bit nop switching, just because I thought it was easier to model.  

Next, I'll improve/rewrite mode switching and selection insns to use 10-bit maybe-switch insns, and 16-bit insns that switch out of compressed mode for 1 or n insns, or remain in compressed mode with or without an extra 16-bit immediate.  Wish me luck ;-)
Comment 17 Alexandre Oliva 2020-11-28 23:41:36 GMT
I had not noticed before writing the above, but the 16-bit immediate compressed encoding is very very similar, in function and possibilities, to the extend-next-insn mode I'd suggested before.

I'm not sure I see why one is too CISC-ish while the other isn't (one vs 5/6 bits to detect is probably a good hint ;-) but it seems to me it could very well be used to significantly extend the range of registers accessible in compressed mode, whether or not we go for 2-reg encoding in 16-bit insns, which is just what I'd hoped to accomplish with an extend-next opcode.

This bit can even be used to make room for some extended opcodes if there are any compressed primary opcodes that have no use for additional operands, e.g. nop or unary operations, and even to reserve room for future extensions.
Comment 18 Luke Kenneth Casson Leighton 2020-11-28 23:45:26 GMT
(In reply to Alexandre Oliva from comment #16)
> Created attachment 122 [details]
> python script skeleton to estimate compression for a given insn selection
> 
> This script is a skeleton to experiment with the selection of insns for
> compressed encoding, so far only supporting the 8-bit nop switching, just
> because I thought it was easier to model.  

ah very cool.  and that's very interesting: only a 6% compression ratio.
with 35,000 transition-nops, if those were not there, that would be...
err... *thinks*... 3 bits thrown out on the transition in, 8 on the
transition out, that's the C.10bit which... guestimate you can get
away with reducing by 2 bytes... if you take out the "transition" bytes
it's 87% so a further 7% reduction.

there must be more going on here.

> Next, I'll improve/rewrite mode switching and selection insns to use 10-bit
> maybe-switch insns, and 16-bit insns that switch out of compressed mode for
> 1 or n insns, or remain in compressed mode with or without an extra 16-bit
> immediate.  Wish me luck ;-)

:)

can you do me a favour and cut/paste the program rather than "complicate"
it by putting in switches / logic for a completely different scheme?

it would be preferable that each script is entirely separate, minimal,
contextually-focussed on one job and one job only, and consequently
much shorter and clearer.

btw i've simply added the script to the ikiwiki
https://libre-soc.org/recentchanges/
Comment 19 Luke Kenneth Casson Leighton 2020-11-28 23:48:55 GMT
(In reply to Alexandre Oliva from comment #17)
> I had not noticed before writing the above, but the 16-bit immediate
> compressed encoding is very very similar, in function and possibilities, to
> the extend-next-insn mode I'd suggested before.

similar but see the comments and detailed explanation provided at the
Compressed bugreport (discussion to take place there, this is the
statistical analysis bugreport)

https://bugs.libre-soc.org/show_bug.cgi?id=238#c84

> I'm not sure I see why one is too CISC-ish while the other isn't (one vs 5/6
> bits to detect is probably a good hint ;-) but it seems to me it could very
> well be used to significantly extend the range of registers accessible in
> compressed mode, whether or not we go for 2-reg encoding in 16-bit insns,
> which is just what I'd hoped to accomplish with an extend-next opcode.

discussion of Compressed encodings under the Compressed bugreport, please,
this one is for statistical analysis methods.  once we've got some tools
and techniques i'll close it whilst the Compressed one will remain open.
Comment 20 Alexandre Oliva 2020-11-28 23:55:24 GMT
the compression ratio result in the skeleton is meaningless, I just added some random insns to the dictionary that encodes the compression decisions.  The idea is for it to be filled in with actual insn encoding data, which we don't have at all for attempt 2.

At ease, I won't mix up the different scripts with the different encoding logic.  They will share some structure, but it's going to be significantly different scripts, more so because I'm going to codify the proposed attempt 1 encodings in it.


Yeah, sorry, I messed up in the bug report number.  I still haven't quite got into my mind the notion that there are separate ones for these related but very different purposes ;-)
Comment 21 Luke Kenneth Casson Leighton 2020-11-29 00:04:08 GMT
(In reply to Alexandre Oliva from comment #20)
> the compression ratio result in the skeleton is meaningless, I just added
> some random insns to the dictionary that encodes the compression decisions. 
> The idea is for it to be filled in with actual insn encoding data, which we
> don't have at all for attempt 2.

ohh ok.

> At ease, I won't mix up the different scripts with the different encoding
> logic.  They will share some structure, but it's going to be significantly
> different scripts, more so because I'm going to codify the proposed attempt
> 1 encodings in it.

(*sotto voice* although it would be longer could you possibly avoid all but
  the most basic of regexs, i just... i can't read them!  or, i can, but it's
  5-10 minutes of looking at them)

btw watch out for "nop" which is "ori r0, r0, 0" of which there are 32,000
in the disasm of bash, and i believe they're data not instructions.


> Yeah, sorry, I messed up in the bug report number.  I still haven't quite
> got into my mind the notion that there are separate ones for these related
> but very different purposes ;-)

we learned over time that the discussions get really long, significantly
slowing down bugzilla and the VM running it.
Comment 22 Luke Kenneth Casson Leighton 2020-11-29 13:05:25 GMT
jacob wrote:

Jacob Lifshay
	
3:38 AM (9 hours ago)
	
to bugzilla-daemon
On Sat, Nov 28, 2020, 16:04 <bugzilla-daemon@libre-soc.org> wrote:

    https://bugs.libre-soc.org/show_bug.cgi?id=532

    --- Comment #21 from Luke Kenneth Casson Leighton <lkcl@lkcl.net> ---
    btw watch out for "nop" which is "ori r0, r0, 0" of which there are 32,000
    in the disasm of bash, and i believe they're data not instructions.


They're actually space for the dynamic linker to insert instructions to change the TOC register iirc. They're required by the ppc64le ABI spec.