we need a way to get rapid feedback on the effectiveness of the Compressed ISA encodings, so that they can be designed properly. the simplest approach: short bash scripts that analyse objdump disassembly, followed by counting instructions. however this starts to get complicated when the limited range of immediates kicks in, and it gets even more complicated when the register allocation could begin to best be redone, i.e. the existence of Compressed begins to affect which instruction patterns *should* have been used. what tools are best for the job here?
from aoliva Here's a hack that adds 'histogram' as a ppc disassembler option. It makes separate buckets according to binutils' own notion of ppc opcodes. I'm not sure that's the most useful way to go about it, but it ought to do as a first cut. I suppose counting significant bits of immediate and offset operands could be useful, but I haven't implemented that yet. I imagine that it might make sense to count branches in a single bucket. Any other thoughts on what else would be useful to have? $ ./objdump -M histogram -d ppc-gcc | sed '1,/^Opcode histogram/d' | sort -n | tail 3386 cmpdi (2c200000) 4306 nop (60000000) 4404 beq (41820000) 5219 b (48000000) 5715 bl (48000001) 8957 li (38000000) 9205 addi (38000000) 10167 std (f8000000) 10718 mr (7c000378) 18828 ld (e8000000)
(In reply to Luke Kenneth Casson Leighton from comment #1) > from aoliva > > Here's a hack that adds 'histogram' as a ppc disassembler option. do add the attachment here when you have a moment > It > makes separate buckets according to binutils' own notion of ppc opcodes. there is a case for both the pseudocode histogram (beq) and for the non-aliased version (bc) one will tell us the number of occurrences of the different CR usage (BO and BI fields), the other tells us in general how many branch-conditionals there are > I'm not sure that's the most useful way to go about it, but it ought to > do as a first cut. > > I suppose counting significant bits of immediate and offset operands > could be useful, but I haven't implemented that yet. > > I imagine that it might make sense to count branches in a single bucket. > > Any other thoughts on what else would be useful to have? register number usage. how many of each register is used. it may help also to break that down by opcode.
Created attachment 120 [details] here's the hack that adds -M histogram to ppc objdump -d
I didn't think breaking things down by registers was of much relevance, since register assignments are almost accidental. That a certain register ends up being used doesn't mean an allocation to fit a smaller register file wouldn't fit. So I don't see what useful information would follow from that break-down. If you see something I don't, please educate me.
-M histogram,raw will get buckets broken down by raw opcode rather than aliases. does this get you what you want WRT bc vs beq?
(In reply to Alexandre Oliva from comment #5) > -M histogram,raw will get buckets broken down by raw opcode rather than > aliases. does this get you what you want WRT bc vs beq? ahh that's it. star. (In reply to Alexandre Oliva from comment #4) > wouldn't fit. So I don't see what useful information would follow from that > break-down. If you see something I don't, please educate me. parent bug or #213. can't find it straight away. ppc64 ABI uses specific regs for certain purposes. targetting those makes the process of identifying the 32 bit instructions to replace with direct 16 bit equivalents that much easier. stack pointer is an obvious candidate. the idea here is not to go "hey let's make this so fantastically idealistically complex that we force ourselves to spend 5 manyears on compiler development". we have not got time to mess about like that. therefore "what regs are most commonly used" is a good first effort such that a straight 1-pass non-state-based replacement algorithm can be dropped in with very little effort and give us a huge drop in code size. later on heavy optimisation can take place by other people.
so, logical deduction time. question: how do we reasonably demonstrate an approximate compression ratio given that: a) the C instructions are a common subset b) the accessible registers likewise. answer: without near-total-rewrite of gcc llvm binutils, some simple one-line no-state pattern-matching will suffice: foreach insn(objdump): if not recognised opcode continue if not regs recognised by replacement continue this is made slightly more complex by it needing to recognise function name boundaries (to preserve glibc6 ABI semantics), as well as recognise the entry and exit from 10bit, 16bit and v3.0B 32bit modes.
(In reply to Luke Kenneth Casson Leighton from comment #7) > answer: without near-total-rewrite of gcc llvm binutils, some simple > one-line no-state pattern-matching will suffice: > > foreach insn(objdump): > if not recognised opcode > continue > if not regs recognised by replacement > continue next question: what's the most likely rapid way to do this assessment. * bash script involving sed regexs that can only be understood by someone who is hyperintelligent and non-dyslexic? (answer is self-evident from question) * c program or actual modification of binutils? answer: it's possibly a little more extensive than the histogram patch however might be surprisingly easy * 50 line non-OO throwaway python script that does line-by-line pattern recognition? answer: high probability of being least amount of work for highest clarity.
FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about fixed registers (assigned to special purposes) and register allocation order: Special-purpose registers on ppc are: r0: constant zero/throw-away r1: stack pointer r2: thread-local storage pointer in 32-bit mode r2: non-minimal TOC register r10: EH return stack adjust register r11: static chain pointer r13: thread-local storage pointer in 64-bit mode r30: minimal-TOC/-fPIC/-fpic base register r31: frame pointer lr: return address register the register allocation order in GCC (i.e., it takes the earliest available register that fits the constraints) is: We allocate in the following order: fp0 (not saved or used for anything) fp13 - fp2 (not saved; incoming fp arg registers) fp1 (not saved; return value) fp31 - fp14 (saved; order given to save least number) cr7, cr5 (not saved or special) cr6 (not saved, but used for vector operations) cr1 (not saved, but used for FP operations) cr0 (not saved, but used for arithmetic operations) cr4, cr3, cr2 (saved) r9 (not saved; best for TImode) r10, r8-r4 (not saved; highest first for less conflict with params) r3 (not saved; return value register) r11 (not saved; later alloc to help shrink-wrap) r0 (not saved; cannot be base reg) r31 - r13 (saved; order given to save least number) r12 (not saved; if used for DImode or DFmode would use r13) ctr (not saved; when we have the choice ctr is better) lr (saved) r1, r2, ap, ca (fixed) v0 - v1 (not saved or used for anything) v13 - v3 (not saved; incoming vector arg registers) v2 (not saved; incoming vector arg reg; return value) v19 - v14 (not saved or used for anything) v31 - v20 (saved; order given to save least number) vrsave, vscr (fixed) sfp (fixed)
modifying binutils so as to count opcodes that we'd like to compress could be done in basically two ways: - add to include/opcode/ppc.h: #define PPC_OPCODE_COMPRESS 0x800000000000ull - in opcodes/ppc-opc.c, mark the opcodes under powerpc_opcodes that are candidate for compression with '| PPC_OPCODE_COMPRESS' in FLAGS (the fourth column) - in opcodes/ppc-dis.c, modify increment_opcode_counter to check for this flag, apply any other constraints on the operands, and count the instruction as compressed if it fits This would require a quick recompilation for every experiment. A more dynamic alternative would add a disassembler option or environment variable or whatever, handled in powerpc_init_dialect, enabling a file name to be specified that contains opcodes to regard as compressed (subject to the same constraints) I'm not sure the latter effort is worth the additional convenience (if any) it would provide. WDYT? As for the extra constraints for insns to fit, that's where I haven't quite grasped what's proposed yet to be able to generalize the transformations and limits from 32-bit to 16- (and/or 10-?) bit opcodes, if there is a general rule at all. I'm still trying to get acquainted with the preexisting opcode formats of PowerISA (do we have a PDF for 3.1? the official site insists that I bring down my Javascript firewalls against web blobs to download it, and I'd rather not) to be better able to make sense of https://libre-soc.org/openpower/sv/16_bit_compressed/ (that I see has changed since I first went through it; moving target much? :-)
(In reply to Alexandre Oliva from comment #9) > FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about fixed > registers (assigned to special purposes) and register allocation order: > > Special-purpose registers on ppc are: > > r0: constant zero/throw-away > r1: stack pointer > r2: thread-local storage pointer in 32-bit mode > r2: non-minimal TOC register > r10: EH return stack adjust register > r11: static chain pointer > r13: thread-local storage pointer in 64-bit mode > r30: minimal-TOC/-fPIC/-fpic base register > r31: frame pointer ah! that's what i thought was SP. excellent. so we need to find out what the range is on the immediate, for this one. it's used a *lot*, same as SP. > lr: return address register > > the register allocation order in GCC (i.e., it takes the earliest available > register that fits the constraints) is: > > We allocate in the following order: > fp0 (not saved or used for anything) > fp13 - fp2 (not saved; incoming fp arg registers) > fp1 (not saved; return value) > fp31 - fp14 (saved; order given to save least number) these aren't contiguous, makes deciding the priority for C allocation something of a pain. hence the reason for seeing actual usage. > r9 (not saved; best for TImode) > r10, r8-r4 (not saved; highest first for less conflict with params) at least these should be mappable > r3 (not saved; return value register) and this
(In reply to Alexandre Oliva from comment #10) > modifying binutils so as to count opcodes that we'd like to compress could > be done in basically two ways: > > - add to include/opcode/ppc.h: > #define PPC_OPCODE_COMPRESS 0x800000000000ull > > - in opcodes/ppc-opc.c, mark the opcodes under powerpc_opcodes that are > candidate for compression with '| PPC_OPCODE_COMPRESS' in FLAGS (the fourth > column) > > - in opcodes/ppc-dis.c, modify increment_opcode_counter to check for this > flag, apply any other constraints on the operands, and count the instruction > as compressed if it fits > > This would require a quick recompilation for every experiment. and that means if trying to compare those side-by-side it's very challenging. whereas a bunch of throwaway 50 line external python scripts is trivial to rerun and compare. > I'm not sure the latter effort is worth the additional convenience (if any) > it would provide. WDYT? 50 line python scripts reading stdin piped from objdump raw seems a lot easier. > As for the extra constraints for insns to fit, that's where I haven't quite > grasped what's proposed yet to be able to generalize the transformations and > limits from 32-bit to 16- (and/or 10-?) bit opcodes, if there is a general > rule at all. OpenPOWER was never designed for dropping 16 bit on top of it. RISCV RVC was ground-up designed for mixed 16/32. we need to make some compromises: the 10 bit format makes best use of the remaining bits whilst transitioning to "full" 16 bit mode. the absolute simplest general rule is literally that from https://bugs.libre-soc.org/show_bug.cgi?id=532#c7 at this stage we really do not want to get into the knock-on effects of register allocation, effect on branches, and so on. anything which, by the very *existence* of Compressed, would result in the compiler emitting radically altered assember, is *not* in scope at this stage: thst is too early. however i would like to find out: * word-aligned SP/FP immediates is enough (for 32 bit LD/ST) this confirms whether shifting the immediate by 2 bits is ok * how many functions have 3, 4, 5 regs as parameters * what sort of range of immediates are there on cmpwi and cmpdi are they mostly low (small) or are they large. are they mostly against zero * which CRs are mostly used by branches > I'm still trying to get acquainted with the preexisting opcode > formats of PowerISA (do we have a PDF for 3.1? ah we are implementing v3.0B. search "raptorcs wiki v3.0b OpenPOWER spec" should come up top hit, no JS. > and I'd rather not) to be better able to make sense of > https://libre-soc.org/openpower/sv/16_bit_compressed/ (that I see has > changed since I first went through it; moving target much? :-) :) i use it to capture ongoing discussions. if i don't they are lost pretty much immediately. to get a rapid overview of OPF v3.0B without needing to go through a 1300 page PDF try these: https://libre-soc.org/openpower/isatables/ https://libre-soc.org/openpower/isa/ bear in mind FP is not added yet so not included in those tables.
(In reply to Alexandre Oliva from comment #10) > I'm still trying to get acquainted with the preexisting opcode > formats of PowerISA (do we have a PDF for 3.1? the official site insists > that I bring down my Javascript firewalls against web blobs to download it, > and I'd rather not) https://wiki.raptorcs.com/w/images/f/f5/PowerISA_public.v3.1.pdf That should be a direct download link.
Created attachment 121 [details] python program for insn histogram with per-operand counts Here's a python script that counts occurrences of raw opcodes (getting the same totals as with the objdump -M histogram patch) and, within them, counts occurrences of each operand (registers, conditions, and bit-lengths of immediates, offsets, and branch ranges).
(In reply to Alexandre Oliva from comment #14) > Created attachment 121 [details] > python program for insn histogram with per-operand counts > > Here's a python script that counts occurrences of raw opcodes (getting the > same totals as with the objdump -M histogram patch) and, within them, counts > occurrences of each operand (registers, conditions, and bit-lengths of > immediates, offsets, and branch ranges). ooo i looove it. i've added it to the ikiwiki git repo https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/insn-histogram.py;hb=HEAD i also did some augmentations to get global subtotals for int regs and Condition Registers. the numbers are quite fascinating.
Created attachment 122 [details] python script skeleton to estimate compression for a given insn selection This script is a skeleton to experiment with the selection of insns for compressed encoding, so far only supporting the 8-bit nop switching, just because I thought it was easier to model. Next, I'll improve/rewrite mode switching and selection insns to use 10-bit maybe-switch insns, and 16-bit insns that switch out of compressed mode for 1 or n insns, or remain in compressed mode with or without an extra 16-bit immediate. Wish me luck ;-)
I had not noticed before writing the above, but the 16-bit immediate compressed encoding is very very similar, in function and possibilities, to the extend-next-insn mode I'd suggested before. I'm not sure I see why one is too CISC-ish while the other isn't (one vs 5/6 bits to detect is probably a good hint ;-) but it seems to me it could very well be used to significantly extend the range of registers accessible in compressed mode, whether or not we go for 2-reg encoding in 16-bit insns, which is just what I'd hoped to accomplish with an extend-next opcode. This bit can even be used to make room for some extended opcodes if there are any compressed primary opcodes that have no use for additional operands, e.g. nop or unary operations, and even to reserve room for future extensions.
(In reply to Alexandre Oliva from comment #16) > Created attachment 122 [details] > python script skeleton to estimate compression for a given insn selection > > This script is a skeleton to experiment with the selection of insns for > compressed encoding, so far only supporting the 8-bit nop switching, just > because I thought it was easier to model. ah very cool. and that's very interesting: only a 6% compression ratio. with 35,000 transition-nops, if those were not there, that would be... err... *thinks*... 3 bits thrown out on the transition in, 8 on the transition out, that's the C.10bit which... guestimate you can get away with reducing by 2 bytes... if you take out the "transition" bytes it's 87% so a further 7% reduction. there must be more going on here. > Next, I'll improve/rewrite mode switching and selection insns to use 10-bit > maybe-switch insns, and 16-bit insns that switch out of compressed mode for > 1 or n insns, or remain in compressed mode with or without an extra 16-bit > immediate. Wish me luck ;-) :) can you do me a favour and cut/paste the program rather than "complicate" it by putting in switches / logic for a completely different scheme? it would be preferable that each script is entirely separate, minimal, contextually-focussed on one job and one job only, and consequently much shorter and clearer. btw i've simply added the script to the ikiwiki https://libre-soc.org/recentchanges/
(In reply to Alexandre Oliva from comment #17) > I had not noticed before writing the above, but the 16-bit immediate > compressed encoding is very very similar, in function and possibilities, to > the extend-next-insn mode I'd suggested before. similar but see the comments and detailed explanation provided at the Compressed bugreport (discussion to take place there, this is the statistical analysis bugreport) https://bugs.libre-soc.org/show_bug.cgi?id=238#c84 > I'm not sure I see why one is too CISC-ish while the other isn't (one vs 5/6 > bits to detect is probably a good hint ;-) but it seems to me it could very > well be used to significantly extend the range of registers accessible in > compressed mode, whether or not we go for 2-reg encoding in 16-bit insns, > which is just what I'd hoped to accomplish with an extend-next opcode. discussion of Compressed encodings under the Compressed bugreport, please, this one is for statistical analysis methods. once we've got some tools and techniques i'll close it whilst the Compressed one will remain open.
the compression ratio result in the skeleton is meaningless, I just added some random insns to the dictionary that encodes the compression decisions. The idea is for it to be filled in with actual insn encoding data, which we don't have at all for attempt 2. At ease, I won't mix up the different scripts with the different encoding logic. They will share some structure, but it's going to be significantly different scripts, more so because I'm going to codify the proposed attempt 1 encodings in it. Yeah, sorry, I messed up in the bug report number. I still haven't quite got into my mind the notion that there are separate ones for these related but very different purposes ;-)
(In reply to Alexandre Oliva from comment #20) > the compression ratio result in the skeleton is meaningless, I just added > some random insns to the dictionary that encodes the compression decisions. > The idea is for it to be filled in with actual insn encoding data, which we > don't have at all for attempt 2. ohh ok. > At ease, I won't mix up the different scripts with the different encoding > logic. They will share some structure, but it's going to be significantly > different scripts, more so because I'm going to codify the proposed attempt > 1 encodings in it. (*sotto voice* although it would be longer could you possibly avoid all but the most basic of regexs, i just... i can't read them! or, i can, but it's 5-10 minutes of looking at them) btw watch out for "nop" which is "ori r0, r0, 0" of which there are 32,000 in the disasm of bash, and i believe they're data not instructions. > Yeah, sorry, I messed up in the bug report number. I still haven't quite > got into my mind the notion that there are separate ones for these related > but very different purposes ;-) we learned over time that the discussions get really long, significantly slowing down bugzilla and the VM running it.
jacob wrote: Jacob Lifshay 3:38 AM (9 hours ago) to bugzilla-daemon On Sat, Nov 28, 2020, 16:04 <bugzilla-daemon@libre-soc.org> wrote: https://bugs.libre-soc.org/show_bug.cgi?id=532 --- Comment #21 from Luke Kenneth Casson Leighton <lkcl@lkcl.net> --- btw watch out for "nop" which is "ori r0, r0, 0" of which there are 32,000 in the disasm of bash, and i believe they're data not instructions. They're actually space for the dynamic linker to insert instructions to change the TOC register iirc. They're required by the ppc64le ABI spec.