original discussion at: https://groups.google.com/d/msg/comp.arch/DeGUdZhh1Jc/LzRc61JSAQAJ i raised an alternative idea last year: a Data Pointer. similar to a Program Counter and/or Stack Pointer, the Data Pointer is incremented implicitly by each operation requiring an immediate, and, in extreme versions, that may be incremented *bit-wise* rather than on byte, word or dword boundaries. opcodes requiring immediates then specify the number of *bits* they wish to use. the Data Pointer automatically increments such that the next instruction loads from the subsequent memory address, not the same one. as an off-the-wall concept, i never thought through how you would actually *set* the Data Pointer in the first place, nor how loops would be dealt with. L1-cacheing of the constants (and reading them and including them in the I-stream) is not a big deal. however separating them from actual instructions does actually become a bit of a bind. perhaps the D-cache could be used (treat them as data). loops are the biggest bugbear. however an idea just occurred to me. with only a few bits being needed to specify how *much* of the memory pointed to by the Data Pointer is to be used as an immediate, the remaining (spare) bits can be taken as an offset by which the Data Pointer is to be "wound back" (or forward). perhaps even some mode-bits could be used to indicate, "New Data Pointer Starts Here and there is a block of immediates of size NN following this instruction": addi R0, R1, {dpencoding} dpencoding: 11 10 9 8 7 | 7 6 | 5 4 3 2 1 0 | immblocklen x4 | immbytelen | immjump x4 | something like that. * immblocklen x 32 is the space reserved following this instruction, to which the Data Pointer is to be immediately set (zero indicates "no change"). thus the PC *knows* that it must add 4ximmblocklen bytes to the PC * immbytelen - the length (in bytes? maybe bits?) of the immediate, 0b000=8, 0b001=16 etc. etc. after the immediate is read (immediately) from the DataPointer address, the DataPointer must be incremented by this amount, only if immjump == 0 * immjump - a signed number of words that (rounded) the DataPointer must "jump" by (forward or backwards) *INSTEAD* of incrementing by immbytelen. that should probably do the trick for a first iteration. analysis: * compared to standard immediate-loading: for small immediates it's not as nice because you have to use the 12-bit immediate space as a pointer therefore, simple solution to that: only use this for long immediates, or, use 1 extra bit to indicate "11-bit immediate or DataPointer" encoding * with no "construction" of long immediates using 32-bit stuff, it's definitely much more compact. variants include that immjump is *always* set, and is always relative to DataPointer. this would actually be better because if we assume that when the DP is set (immblocklen != 0) that instruction wants the *first* item @DP, we no longer need both immblocklen and immjump: 7 | 6 5 4 3 2 | 1 0 | mode | DPlen_or_DPoffs | immlen | * mode == 0: DPlen x 4 bytes of "offsets" follow this instruction. - set DP = PC + 4 - assume DPoffs = 0. - immed = @[DP] of length immlen (bytes? words? - add DPlen x 4 to PC * mode == 1: DPoffs is in bytes 6:2. - PC increments as normal (+4) - DP does *NOT CHANGE* - immed = @[DP + DPoffs*4] of length immlen (bytes? words?) this gives relatively few bits, and it would not be unreasonable to snarf a few bits out of a "normal" 12-bit immediate range: 12 11 10 9 8 | 7 6 5 4 3 2 1 0 1 1 1 1 1 | { dpencoding } N N N N N N N N N N N N N - "normal" immediate the only complexity would be compiler-side creation of tables that initialise correctly and in reasonable proximity. also you'd either lose one main register (one as normal SP, one now as DP) or it would be required to add a special register (CSR), plus context-saving to ensure it's not corrupted during interrupts. l. assume the 2nd scheme in the latter part of what i wrote (absolute-addressed DataPointer rather than the incremental-version): # some prior code which happened already to set DP # or we just have to have a dummy instruction. 0000 0000 mvi r0, DPlen=2 ; r0 discarded in this example 0000 0004 LongConst1 ; could be 64 bit or 128 or 256 bit 0000 0008 LongConst2 ; could be 64 bit or 128 or 256 bit 0000 000c bne x, 0, sety2 0000 0010 mvi y, @DP#0 ; DP was set to 0000 0004 earlier 0000 0014 jmp end 0000 0018 mvi y, @DP#2 ; sety2 0000 001c .... ; end the alternative, to save "wasting" an instruction (if there really are no other prior immediates) is to actually set y unconditionally, just to get the DataPointer set up then *conditionally* overwrite it.
A banked data scheme, like on the 65816, would be a slightly simpler version. Instead of incrementing a pointer, the data bank register contains the high bits, and the instruction contains the low bits. Such a setup also lets you access larger-than-11-bit immediates, and is basically MIPS's small data section made customizable. It would still be relocatable, just in 11 bit alignments. The linker would adjust the set-data-bank-register instructions on program load, for position-independent code.
(In reply to cand from comment #1) > A banked data scheme, like on the 65816, would be a slightly simpler > version. Instead of incrementing a pointer, the data bank register contains > the high bits, and the instruction contains the low bits. that's the 2nd version i came up with > Such a setup also > lets you access larger-than-11-bit immediates, however you still need bits (or an instruction) specifying the immediate width, somehow. which is why i borrowed a couple of bits from the immediate to do that. > and is basically MIPS's small > data section made customizable. interesting. worth investigating. > It would still be relocatable, just in 11 bit alignments. The linker would > adjust the set-data-bank-register instructions on program load, for > position-independent code. nice. my feeling is we should take thus seriously because large immediate data load is such a massive part of programs yet they are highly space inefficient.
Yes, the amount of instructions POWER needs to load a 64-bit immediate is not very efficient.
(In reply to cand from comment #3) > Yes, the amount of instructions POWER needs to load a 64-bit immediate is > not very efficient. mitch alsup mentioned that those comprise something like 6% of all instructions (!). i'm blown away that this has been a missed opportunity to significantly reduce code size and i-cache usage. particularly as a reduction in i-cache size obeys a square law reduction in power consumption. therefore if we can get it down 5% that's a 10.25% reduction in power (!)
(In reply to Luke Kenneth Casson Leighton from comment #4) > particularly as a reduction in i-cache > size obeys a square law reduction in power consumption. I would expect the cache power usage to be linearly proportional to it's size in bits once it's big enough to split into many smaller sram chunks, since each sram chunk would use about the same power and you'd only need a linear number of them. I have not been able to find any useful references for that though. > therefore if we can > get it down 5% that's a 10.25% reduction in power (!) yes -- except that cache sizes are almost always a power of 2 in size (or rarely 3 * 2^n), so the only effect is probably to reduce the access rate.
(In reply to Jacob Lifshay from comment #5) > (In reply to Luke Kenneth Casson Leighton from comment #4) > > particularly as a reduction in i-cache > > size obeys a square law reduction in power consumption. > > I would expect the cache power usage to be linearly proportional to it's > size in bits once it's big enough to split into many smaller sram chunks, > since each sram chunk would use about the same power and you'd only need a > linear number of them. whilst the cache power usage itself is linear (that being one dimension of the square effect), the increased size means that on loops you cannot fit as much code of the average loop *into* that cache, meaning that you end up with more L2 cache lookups. this i believe is where the 2nd of the two dimensions of the square effect comes in. it's quite... odd :)
(In reply to Luke Kenneth Casson Leighton from comment #6) > (In reply to Jacob Lifshay from comment #5) > > (In reply to Luke Kenneth Casson Leighton from comment #4) > > > particularly as a reduction in i-cache > > > size obeys a square law reduction in power consumption. > > > > I would expect the cache power usage to be linearly proportional to it's > > size in bits once it's big enough to split into many smaller sram chunks, > > since each sram chunk would use about the same power and you'd only need a > > linear number of them. > > whilst the cache power usage itself is linear (that being one dimension > of the square effect), the increased size means that on loops you cannot > fit as much code of the average loop *into* that cache, meaning that you > end up with more L2 cache lookups. this i believe is where the 2nd of > the two dimensions of the square effect comes in. I think you might be incorrect: increasing the size of the L1 cache should increase hit rate which decreases the rate of accessing L2 and memory. Having a bigger cache allows fitting more code into the cache, reducing the miss rate. So, increasing the cache size would have a slightly less than linear increase in power consumption due to decreasing L2 accesses.
https://reverseengineering.stackexchange.com/questions/21944/powerpc-toc-and-sda DataPointer turns out to be an extension of the concept "TOC" register (r2)