Wasn't able to find the previous bug for this, so opening a new one. http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001173.html https://groups.google.com/g/comp.arch/c/o_TbFGhgokA/m/1eucHw4oAwAJ
Basic idea: get around powerpc opcode being in 3rd byte by conceptually swapping the order 16-bit halves of each 32-bit word are read from instruction memory. Read bytes in the zero-based order: 2,3, 0,1, 6,7, 4,5, 10,11, 8,9, ... Diagram: BE version: bytes 0,1: 16-bit instruction -> executed 1st bytes 2,3: msb 2-bytes of 32-bit instruction -+-> executed 2nd bytes 4,5: lsb 2-bytes of 32-bit instruction -+ bytes 6,7: 16-bit instruction -> executed 3rd bytes 8-11: 32-bit instruction -> executed 4th LE version: bytes 0,1: msb 2-bytes of 32-bit instruction --+ bytes 2,3: 16-bit instruction -> executed 1st +-> executed 2nd bytes 4,5: 16-bit instruction -> executed 3rd | bytes 6,7: lsb 2-bytes of 32-bit instruction --+ bytes 8-11: 32-bit instruction -> executed 4th
(In reply to Jacob Lifshay from comment #0) > Wasn't able to find the previous bug for this, so opening a new one. thanks jacob. i vaguely recall the discussion, i can't remember where. on seeing the explanation clearly laid out on comp.arch i think it's a really good solution because it doesn't interfere or get confused with OpenPOWER LE/BE mode. i mean, it *is* LE/BE swapping, but at the 2x16-bit granularity level where LE/BE is normally byte-level swapping. i will add a stub wiki page for this, linked off of openpower/sv
https://libre-soc.org/openpower/sv/major_opcode_allocation/ it's complicated but quite elegant, i like it. i am trying to think through how mixed 16/32/48 would actually work. it woukd be necessary i think to pop 16 bit instructions out of bytes 2+3 and leave 0+1 still at the front of the queue.
(In reply to Luke Kenneth Casson Leighton from comment #3) > https://libre-soc.org/openpower/sv/major_opcode_allocation/ > > it's complicated but quite elegant, i like it. i am trying to think through > how mixed 16/32/48 would actually work. it woukd be necessary i think to > pop 16 bit instructions out of bytes 2+3 and leave 0+1 still at the front of > the queue. does this work? are there any cases where conceptually swapping bytes 0+1 with 2+3 at the memory level would not work? mixing 16/32/48 together?
(In reply to Luke Kenneth Casson Leighton from comment #4) > (In reply to Luke Kenneth Casson Leighton from comment #3) > > https://libre-soc.org/openpower/sv/major_opcode_allocation/ The explanation on the wiki page seems quite a bit less general (limiting alignment) than what I was envisioning: I was thinking of conceptually the instruction stream would just be a stream of aligned 16-bit chunks which are decoded into *totally-unaligned* 16/32/48/64-bit instructions by combining 1/2/3/4 chunks in the conceptual sequence. All different instruction sizes can be arbitrarily interleaved. The only twists are: - that the 16-bit chunks are laid out oddly in LE mode for backward compatibility. - that jumps/branches/returns/calls can only branch to 32-bit aligned addresses, so the branch targets need to be aligned by either using a larger equivalent instruction (preferred) or inserting NOPs. interrupt/exception returns *can* branch to 16-bit aligned addresses, however, since that's needed for preemptive context switching. > > it's complicated but quite elegant, i like it. i am trying to think through > > how mixed 16/32/48 would actually work. it woukd be necessary i think to > > pop 16 bit instructions out of bytes 2+3 and leave 0+1 still at the front of > > the queue. > > does this work? are there any cases where conceptually swapping bytes 0+1 > with 2+3 at the memory level would not work? mixing 16/32/48 together? it would require some additional thought, but I think it probably would. We would need to decide what to do for PC-relative instructions, do we include the second-from-lsb in the visible PC or not?
(In reply to Jacob Lifshay from comment #5) > (In reply to Luke Kenneth Casson Leighton from comment #4) > > (In reply to Luke Kenneth Casson Leighton from comment #3) > > > https://libre-soc.org/openpower/sv/major_opcode_allocation/ > > The explanation on the wiki page seems quite a bit less general (limiting > alignment) than what I was envisioning: yes i misunderstood (but accidentally came up with an alternative, which is slightly more complex i.e. involves a queue and needs to be able to "take" from the 1st 2 entries rather than always take from the front) > I was thinking of conceptually the instruction stream would just be a stream > of aligned 16-bit chunks which are decoded into *totally-unaligned* > 16/32/48/64-bit instructions by combining 1/2/3/4 chunks in the conceptual > sequence. as 2/1/4/3/6/5 order (in 16-bit chunks). > All different instruction sizes can be arbitrarily interleaved. > > The only twists are: > - that the 16-bit chunks are laid out oddly in LE mode for backward > compatibility. > - that jumps/branches/returns/calls can only branch to 32-bit aligned > addresses, so the branch targets need to be aligned by either using a larger > equivalent instruction (preferred) or inserting NOPs. yeah we're not going to modify PowerISA to add the extra bit to target jumps at the 16-bit level. bit of a pain. > interrupt/exception > returns *can* branch to 16-bit aligned addresses, however, since that's > needed for preemptive context switching. as long as the full CIA/NIA is stored (and restored), yes. > it would require some additional thought, but I think it probably would. ultimately though it comes down to which takes more gates. implicit hword-swapping (hidden from the actual instruction decoder) seems a lot simpler. > We would need to decide what to do for PC-relative instructions, do we > include the second-from-lsb in the visible PC or not? urr that's a wrinkle. ok p37 v3.0B "branch" pseudocode: if AA then NIA <-iea EXTS(LI || 0b00) else NIA <-iea CIA + EXTS(LI || 0b00) if LK then LR <-iea CIA + 4 the assumption is always that the CIA is word-aligned. that means that any computations, if they start from a non-word-aligned point, will stay at a non-word-aligned point. nuts. ok so here's two options: * all 32-bit branches (and SV-P48/64 ones) start at word-aligned boundaries OR that they're *assumed* to start at the word-aligned boundary and then: * that we design some 16-bit instructions which can be hword-aligned however, the calculation of LR is definitely no longer "CIA+4", it's going to be "CIA+len(instruction)" which is variable. so for example, b/ba/bl/bla would become: if AA then NIA <-iea EXTS(LI || 0b00) else NIA <-iea CIA + EXTS(LI || 0b00) NIA[0:2] = 0b00 # set 2 LSBs to zero if LK then LR <-iea CIA + len(current_instruction) this shouuuld be ok... and 16-bit branch instructions, although there will be far less space for an immediate, would be hword-aligned, taking care of being able to jump at 16-bit granularity. TAR (ignoring the 2 LSBs) also needs to be evaluated (ignore only 1 LSB?) however we'd need to find out if the 2 LSBs are actually used by any compilers (p32 2.3.4)
i updated the wiki page. what happens with these 3 instructions? * 1 16 bit * 2 32 bit * 3 32 bit byte 0 1 2 3 2 HI32 1 16 3 HI32 2 LO32 ... 3 LO32 this would be reordered: 1 16 | 2 HI32 LO32 | 3 HI32 LO32 and then... because the 1st one is 16bit it's processed and then you look at HI32 and find that from the Major Opcode it's 32 bit same for SV-P48 i think this will work just fine.
that's interesting. VLE was specified as only possible in BE instruction encoding mode. http://application-notes.digchip.com/314/314-68105.pdf (note: "instruction BE encoding mode" != same as "BE data encoding mode")
in this comment https://bugs.libre-soc.org/show_bug.cgi?id=238#c70 we learn that the intention is for 32 bit instructions to look like LE even when embedded in 48/64 SV Prefixes. however "looking like ppc64le ABI" is not the same as "can be executed and is fully conformant with the ppc64le ABI". can we therefore go through it, with some bare minimum worked examples that show, step by step, the transition to calling and returning from standard ppc64le ABI functions? i have a sneaking suspicion that we may need an instruction that enables/disables SV on entry and exit from functions, which is going to be a costly overhead. secondly: are we overthinking this? the primary motivation is to accelerate video and 3D fragments. a full (scalar, standard v3.0 compiled) OS is far down the list
rrright. nooow the VLE 64k pages being marked as such makes sense. the alternative encoding is an implicit marker that completely separates standard v3.0B ppc64le ABI code from VLE/Compressed/SVPrefix code, leaving the opportunity to have data entirely LE. calls from one to the other are not a problem and do not need a mode-switching instruction because the *page bit* is that marker.