Bug 529 - scheme for supporting 16/48-bit instructions on PowerPC LE with full backward compatibility
Summary: scheme for supporting 16/48-bit instructions on PowerPC LE with full backward...
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: All All
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 213
  Show dependency treegraph
 
Reported: 2020-11-11 02:46 GMT by Jacob Lifshay
Modified: 2020-11-23 13:46 GMT (History)
3 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Lifshay 2020-11-11 02:46:50 GMT
Wasn't able to find the previous bug for this, so opening a new one.

http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001173.html

https://groups.google.com/g/comp.arch/c/o_TbFGhgokA/m/1eucHw4oAwAJ
Comment 1 Jacob Lifshay 2020-11-11 02:56:57 GMT
Basic idea: get around powerpc opcode being in 3rd byte by conceptually swapping the order 16-bit halves of each 32-bit word are read from instruction memory. Read bytes in the zero-based order:
2,3, 0,1,  6,7, 4,5,  10,11, 8,9,  ...

Diagram:
BE version:
bytes 0,1: 16-bit instruction -> executed 1st
bytes 2,3: msb 2-bytes of 32-bit instruction -+-> executed 2nd
bytes 4,5: lsb 2-bytes of 32-bit instruction -+
bytes 6,7: 16-bit instruction -> executed 3rd
bytes 8-11: 32-bit instruction -> executed 4th

LE version:
bytes 0,1: msb 2-bytes of 32-bit instruction --+
bytes 2,3: 16-bit instruction -> executed 1st  +-> executed 2nd
bytes 4,5: 16-bit instruction -> executed 3rd  |
bytes 6,7: lsb 2-bytes of 32-bit instruction --+
bytes 8-11: 32-bit instruction -> executed 4th
Comment 2 Luke Kenneth Casson Leighton 2020-11-11 21:48:58 GMT
(In reply to Jacob Lifshay from comment #0)
> Wasn't able to find the previous bug for this, so opening a new one.

thanks jacob.  i vaguely recall the discussion, i can't remember where.  on seeing the explanation clearly laid out on comp.arch i think it's a really good solution because it doesn't interfere or get confused with OpenPOWER LE/BE mode.

i mean, it *is* LE/BE swapping, but at the 2x16-bit granularity level where LE/BE is normally byte-level swapping.

i will add a stub wiki page for this, linked off of openpower/sv
Comment 3 Luke Kenneth Casson Leighton 2020-11-11 22:27:15 GMT
https://libre-soc.org/openpower/sv/major_opcode_allocation/

it's complicated but quite elegant, i like it.  i am trying to think through how mixed 16/32/48 would actually work.  it woukd be necessary i think to pop 16 bit instructions out of bytes 2+3 and leave 0+1 still at the front of the queue.
Comment 4 Luke Kenneth Casson Leighton 2020-11-11 22:34:19 GMT
(In reply to Luke Kenneth Casson Leighton from comment #3)
> https://libre-soc.org/openpower/sv/major_opcode_allocation/
> 
> it's complicated but quite elegant, i like it.  i am trying to think through
> how mixed 16/32/48 would actually work.  it woukd be necessary i think to
> pop 16 bit instructions out of bytes 2+3 and leave 0+1 still at the front of
> the queue.

does this work? are there any cases where conceptually swapping bytes 0+1 with 2+3 at the memory level would not work? mixing 16/32/48 together?
Comment 5 Jacob Lifshay 2020-11-12 09:11:04 GMT
(In reply to Luke Kenneth Casson Leighton from comment #4)
> (In reply to Luke Kenneth Casson Leighton from comment #3)
> > https://libre-soc.org/openpower/sv/major_opcode_allocation/

The explanation on the wiki page seems quite a bit less general (limiting alignment) than what I was envisioning:

I was thinking of conceptually the instruction stream would just be a stream of aligned 16-bit chunks which are decoded into *totally-unaligned* 16/32/48/64-bit instructions by combining 1/2/3/4 chunks in the conceptual sequence. All different instruction sizes can be arbitrarily interleaved.

The only twists are:
- that the 16-bit chunks are laid out oddly in LE mode for backward compatibility.
- that jumps/branches/returns/calls can only branch to 32-bit aligned addresses, so the branch targets need to be aligned by either using a larger equivalent instruction (preferred) or inserting NOPs. interrupt/exception returns *can* branch to 16-bit aligned addresses, however, since that's needed for preemptive context switching.

> > it's complicated but quite elegant, i like it.  i am trying to think through
> > how mixed 16/32/48 would actually work.  it woukd be necessary i think to
> > pop 16 bit instructions out of bytes 2+3 and leave 0+1 still at the front of
> > the queue.
> 
> does this work? are there any cases where conceptually swapping bytes 0+1
> with 2+3 at the memory level would not work? mixing 16/32/48 together?

it would require some additional thought, but I think it probably would.

We would need to decide what to do for PC-relative instructions, do we include the second-from-lsb in the visible PC or not?
Comment 6 Luke Kenneth Casson Leighton 2020-11-12 13:55:09 GMT
(In reply to Jacob Lifshay from comment #5)
> (In reply to Luke Kenneth Casson Leighton from comment #4)
> > (In reply to Luke Kenneth Casson Leighton from comment #3)
> > > https://libre-soc.org/openpower/sv/major_opcode_allocation/
> 
> The explanation on the wiki page seems quite a bit less general (limiting
> alignment) than what I was envisioning:

yes i misunderstood (but accidentally came up with an alternative, which
is slightly more complex i.e. involves a queue and needs to be able to
"take" from the 1st 2 entries rather than always take from the front)

> I was thinking of conceptually the instruction stream would just be a stream
> of aligned 16-bit chunks which are decoded into *totally-unaligned*
> 16/32/48/64-bit instructions by combining 1/2/3/4 chunks in the conceptual
> sequence. 

as 2/1/4/3/6/5 order (in 16-bit chunks).

> All different instruction sizes can be arbitrarily interleaved.
> 
> The only twists are:
> - that the 16-bit chunks are laid out oddly in LE mode for backward
> compatibility.
> - that jumps/branches/returns/calls can only branch to 32-bit aligned
> addresses, so the branch targets need to be aligned by either using a larger
> equivalent instruction (preferred) or inserting NOPs.

yeah we're not going to modify PowerISA to add the extra bit to target
jumps at the 16-bit level.  bit of a pain.

> interrupt/exception
> returns *can* branch to 16-bit aligned addresses, however, since that's
> needed for preemptive context switching.

as long as the full CIA/NIA is stored (and restored), yes.

> it would require some additional thought, but I think it probably would.

ultimately though it comes down to which takes more gates.  implicit
hword-swapping (hidden from the actual instruction decoder) seems a lot
simpler.
 
> We would need to decide what to do for PC-relative instructions, do we
> include the second-from-lsb in the visible PC or not?

urr that's a wrinkle.  ok p37 v3.0B "branch" pseudocode:

    if AA then NIA  <-iea EXTS(LI || 0b00)
    else       NIA  <-iea CIA + EXTS(LI || 0b00)
    if LK then LR <-iea  CIA + 4

the assumption is always that the CIA is word-aligned.  that means that
any computations, if they start from a non-word-aligned point, will stay
at a non-word-aligned point.

nuts.

ok so here's two options:

* all 32-bit branches (and SV-P48/64 ones) start at word-aligned boundaries
  OR
  that they're *assumed* to start at the word-aligned boundary

and then:

* that we design some 16-bit instructions which can be hword-aligned

however, the calculation of LR is definitely no longer "CIA+4", it's going
to be "CIA+len(instruction)" which is variable.

so for example, b/ba/bl/bla would become:

    if AA then NIA  <-iea EXTS(LI || 0b00)
    else       NIA  <-iea CIA + EXTS(LI || 0b00)
    NIA[0:2] = 0b00 # set 2 LSBs to zero
    if LK then LR <-iea  CIA + len(current_instruction)

this shouuuld be ok... and 16-bit branch instructions, although there
will be far less space for an immediate, would be hword-aligned, taking
care of being able to jump at 16-bit granularity.

TAR (ignoring the 2 LSBs) also needs to be evaluated (ignore only 1 LSB?)
however we'd need to find out if the 2 LSBs are actually used by any
compilers (p32 2.3.4)
Comment 7 Luke Kenneth Casson Leighton 2020-11-13 00:20:22 GMT
i updated the wiki page.

what happens with these 3 instructions?

* 1 16 bit
* 2 32 bit
* 3 32 bit

byte 0    1    2    3
     2 HI32    1 16
     3 HI32    2 LO32
     ...       3 LO32

this would be reordered:

1 16 | 2 HI32 LO32 | 3 HI32 LO32

and then...  because the 1st one is 16bit it's processed and then you look at HI32 and find that from the Major Opcode it's 32 bit

same for SV-P48

i think this will work just fine.
Comment 8 Luke Kenneth Casson Leighton 2020-11-14 23:10:13 GMT
that's interesting.  VLE was specified as only possible in BE instruction
encoding mode.  http://application-notes.digchip.com/314/314-68105.pdf

(note: "instruction BE encoding mode" != same as "BE data encoding mode")
Comment 9 Alexandre Oliva 2020-11-22 19:14:11 GMT

    
Comment 10 Alexandre Oliva 2020-11-22 20:15:10 GMT

    
Comment 11 Luke Kenneth Casson Leighton 2020-11-23 13:14:02 GMT
in this comment
https://bugs.libre-soc.org/show_bug.cgi?id=238#c70

we learn that the intention is for 32 bit instructions to look like LE even when embedded in 48/64 SV Prefixes.

however "looking like ppc64le ABI" is not the same as "can be executed and is fully conformant with the ppc64le ABI".

can we therefore go through it, with some bare minimum worked examples that show, step by step, the transition to calling and returning from standard ppc64le ABI functions?

i have a sneaking suspicion that we may need an instruction that enables/disables SV on entry and exit from functions, which is going to be a costly overhead.

secondly: are we overthinking this? the primary motivation is to accelerate video and 3D fragments.  a full (scalar, standard v3.0 compiled) OS is far down the list
Comment 12 Luke Kenneth Casson Leighton 2020-11-23 13:46:12 GMT
rrright.  nooow the VLE 64k pages being marked as such makes sense.

the alternative encoding is an implicit marker that completely separates standard v3.0B ppc64le ABI code from VLE/Compressed/SVPrefix code, leaving the opportunity to have data entirely LE.

calls from one to the other are not a problem and do not need a mode-switching instruction because the *page bit* is that marker.