Bug 286 - DataPointer concept: long-immediate references
Summary: DataPointer concept: long-immediate references
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks:
 
Reported: 2020-04-14 18:02 BST by Luke Kenneth Casson Leighton
Modified: 2020-11-22 04:07 GMT (History)
3 users (show)

See Also:
NLnet milestone: NLNet.2019.10.Formal
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-04-14 18:02:03 BST
original discussion at:
https://groups.google.com/d/msg/comp.arch/DeGUdZhh1Jc/LzRc61JSAQAJ


i raised an alternative idea last year: a Data Pointer.  similar to a
Program Counter and/or Stack Pointer, the Data Pointer is incremented
implicitly by each operation requiring an immediate, and, in extreme
versions, that may be incremented *bit-wise* rather than on byte, word
or dword boundaries.

opcodes requiring immediates then specify the number of *bits* they wish to use. the Data Pointer automatically increments such that the next instruction
loads from the subsequent memory address, not the same one.

as an off-the-wall concept, i never thought through how you would actually *set* the Data Pointer in the first place, nor how loops would be dealt with.

L1-cacheing of the constants (and reading them and including them in the
I-stream) is not a big deal.  however separating them from actual instructions
does actually become a bit of a bind.  perhaps the D-cache could be used
(treat them as data).

loops are the biggest bugbear.  however an idea just occurred to me.
with only a few bits being needed to specify how *much* of the memory pointed
to by the Data Pointer is to be used as an immediate, the remaining (spare)
bits can be taken as an offset by which the Data Pointer is to be "wound back"
(or forward).

perhaps even some mode-bits could be used to indicate, "New Data Pointer Starts
Here and there is a block of immediates of size NN following this instruction":

addi R0, R1, {dpencoding}

dpencoding:

11 10 9 8 7    | 7 6        | 5 4 3 2 1 0  |
immblocklen x4 | immbytelen | immjump x4   |

something like that.

* immblocklen x 32 is the space reserved following this instruction,
  to which the Data Pointer is to be immediately set (zero indicates
  "no change").  thus the PC *knows* that it must add 4ximmblocklen
  bytes to the PC
* immbytelen - the length (in bytes? maybe bits?) of the immediate,
  0b000=8, 0b001=16 etc. etc.  after the immediate is read (immediately)
  from the DataPointer address, the DataPointer must be incremented
  by this amount, only if immjump == 0
* immjump - a signed number of words that (rounded) the DataPointer
  must "jump" by (forward or backwards) *INSTEAD* of incrementing
  by immbytelen.

that should probably do the trick for a first iteration.

analysis:

* compared to standard immediate-loading: for small immediates it's not
  as nice because you have to use the 12-bit immediate space as a pointer
  therefore, simple solution to that: only use this for long immediates,
  or, use 1 extra bit to indicate "11-bit immediate or DataPointer"
  encoding

* with no "construction" of long immediates using 32-bit stuff, it's
  definitely much more compact.


variants include that immjump is *always* set, and is always relative
to DataPointer.  this would actually be better because if we assume
that when the DP is set (immblocklen != 0) that instruction wants
the *first* item @DP, we no longer need both immblocklen and immjump:

7    | 6 5 4 3 2       | 1 0    |
mode | DPlen_or_DPoffs | immlen |

* mode == 0: DPlen x 4 bytes of "offsets" follow this instruction.
             - set DP = PC + 4
             - assume DPoffs = 0.
             - immed = @[DP] of length immlen (bytes? words?
             - add DPlen x 4 to PC

* mode == 1: DPoffs is in bytes 6:2.
             - PC increments as normal (+4)
             - DP does *NOT CHANGE*
             - immed = @[DP + DPoffs*4] of length immlen (bytes? words?)

this gives relatively few bits, and it would not be unreasonable to snarf
a few bits out of a "normal" 12-bit immediate range:

12 11 10 9 8 |   7 6 5 4 3 2 1 0
1  1   1 1 1 | { dpencoding }
N  N   N N N     N N N N N N N N - "normal" immediate


the only complexity would be compiler-side creation of tables that initialise
correctly and in reasonable proximity.

also you'd either lose one main register (one as normal SP, one now as DP)
or it would be required to add a special register (CSR), plus context-saving
to ensure it's not corrupted during interrupts.

l.




assume the 2nd scheme in the latter part of what i wrote (absolute-addressed
DataPointer rather than the incremental-version):

# some prior code which happened already to set DP
# or we just have to have a dummy instruction.
0000 0000    mvi  r0,  DPlen=2   ; r0 discarded in this example
0000 0004    LongConst1          ; could be 64 bit or 128 or 256 bit
0000 0008    LongConst2          ; could be 64 bit or 128 or 256 bit
0000 000c    bne   x,  0, sety2
0000 0010    mvi   y, @DP#0   ; DP was set to 0000 0004 earlier
0000 0014    jmp   end
0000 0018    mvi   y, @DP#2   ; sety2
0000 001c    ....             ; end

the alternative, to save "wasting" an instruction (if there really
are no other prior immediates) is to actually set y unconditionally,
just to get the DataPointer set up then *conditionally* overwrite it.
Comment 1 cand 2020-04-15 07:41:16 BST
A banked data scheme, like on the 65816, would be a slightly simpler version. Instead of incrementing a pointer, the data bank register contains the high bits, and the instruction contains the low bits. Such a setup also lets you access larger-than-11-bit immediates, and is basically MIPS's small data section made customizable.

It would still be relocatable, just in 11 bit alignments. The linker would adjust the set-data-bank-register instructions on program load, for position-independent code.
Comment 2 Luke Kenneth Casson Leighton 2020-04-15 10:10:58 BST
(In reply to cand from comment #1)
> A banked data scheme, like on the 65816, would be a slightly simpler
> version. Instead of incrementing a pointer, the data bank register contains
> the high bits, and the instruction contains the low bits.

that's the 2nd version i came up with

> Such a setup also
> lets you access larger-than-11-bit immediates,

however you still need bits (or an instruction) specifying the immediate width, somehow.

which is why i borrowed a couple of bits from the immediate to do that.

> and is basically MIPS's small
> data section made customizable.

interesting. worth investigating.

> It would still be relocatable, just in 11 bit alignments. The linker would
> adjust the set-data-bank-register instructions on program load, for
> position-independent code.

nice.

my feeling is we should take thus seriously because large immediate data load is such a massive part of programs yet they are highly space inefficient.
Comment 3 cand 2020-04-15 10:23:47 BST
Yes, the amount of instructions POWER needs to load a 64-bit immediate is not very efficient.
Comment 4 Luke Kenneth Casson Leighton 2020-04-15 11:00:52 BST
(In reply to cand from comment #3)
> Yes, the amount of instructions POWER needs to load a 64-bit immediate is
> not very efficient.

mitch alsup mentioned that those comprise something like 6% of all instructions
(!).  i'm blown away that this has been a missed opportunity to significantly
reduce code size and i-cache usage.  particularly as a reduction in i-cache size obeys a square law reduction in power consumption.  therefore if we can get it down 5% that's a 10.25% reduction in power (!)
Comment 5 Jacob Lifshay 2020-04-15 19:44:15 BST
(In reply to Luke Kenneth Casson Leighton from comment #4)
>   particularly as a reduction in i-cache
> size obeys a square law reduction in power consumption.

I would expect the cache power usage to be linearly proportional to it's size in bits once it's big enough to split into many smaller sram chunks, since each sram chunk would use about the same power and you'd only need a linear number of them. I have not been able to find any useful references for that though.

>  therefore if we can
> get it down 5% that's a 10.25% reduction in power (!)

yes -- except that cache sizes are almost always a power of 2 in size (or rarely 3 * 2^n), so the only effect is probably to reduce the access rate.
Comment 6 Luke Kenneth Casson Leighton 2020-04-15 20:22:38 BST
(In reply to Jacob Lifshay from comment #5)
> (In reply to Luke Kenneth Casson Leighton from comment #4)
> >   particularly as a reduction in i-cache
> > size obeys a square law reduction in power consumption.
> 
> I would expect the cache power usage to be linearly proportional to it's
> size in bits once it's big enough to split into many smaller sram chunks,
> since each sram chunk would use about the same power and you'd only need a
> linear number of them.

whilst the cache power usage itself is linear (that being one dimension
of the square effect), the increased size means that on loops you cannot
fit as much code of the average loop *into* that cache, meaning that you
end up with more L2 cache lookups.  this i believe is where the 2nd of
the two dimensions of the square effect comes in.

it's quite... odd :)
Comment 7 Jacob Lifshay 2020-04-15 20:40:16 BST
(In reply to Luke Kenneth Casson Leighton from comment #6)
> (In reply to Jacob Lifshay from comment #5)
> > (In reply to Luke Kenneth Casson Leighton from comment #4)
> > >   particularly as a reduction in i-cache
> > > size obeys a square law reduction in power consumption.
> > 
> > I would expect the cache power usage to be linearly proportional to it's
> > size in bits once it's big enough to split into many smaller sram chunks,
> > since each sram chunk would use about the same power and you'd only need a
> > linear number of them.
> 
> whilst the cache power usage itself is linear (that being one dimension
> of the square effect), the increased size means that on loops you cannot
> fit as much code of the average loop *into* that cache, meaning that you
> end up with more L2 cache lookups.  this i believe is where the 2nd of
> the two dimensions of the square effect comes in.

I think you might be incorrect: increasing the size of the L1 cache should increase hit rate which decreases the rate of accessing L2 and memory. Having a bigger cache allows fitting more code into the cache, reducing the miss rate. So, increasing the cache size would have a slightly less than linear increase in power consumption due to decreasing L2 accesses.
Comment 8 Luke Kenneth Casson Leighton 2020-11-22 04:07:45 GMT
https://reverseengineering.stackexchange.com/questions/21944/powerpc-toc-and-sda

DataPointer turns out to be an extension of the concept "TOC" register (r2)