we've had a request from the OpenPOWER foundation that ISA switching be a privileged operation. this to prevent endusers from mix and matching arbitrary assembler from multiple compilers and then expecting "vanilla" systems to support the resultant excruciatingly nauseous binaries. a privileged operation will allow say a thread to be established by the kernel with the same memory space as the userspace application but with a different executable (the 3D shader compiled application) in particular we have a bit of a problem when it comes to inlining POWER assembler with SVPrefix: POWER opcodes can be either LE or BE, and unfortunately it is the TOP 6 bits which specify the opcode. if they are LE, it is not ok, exactly, to read 32 bits and have the decoder then work out if the opcode is actually a 16 bit Compressed instruction. that just doesn't work. it's "ok" for 48 and 64 bit SVPrefix (just a bit weird) however for C it most definitely is not. however for BE it just so happens that those top 6 bits of a 32 bit opcode will be byte reversed and so will be in the very first byte of an instruction stream. this kind of mess is also why it should be a privileged op to do the ISA switch, really. needs thinking through, properly.
What could work is to have new ELF flags to indicate which ABI variants a particular executable, object (.o file), or dynamic library requires. The dynamic linker and standard linker would make sure that the files are compatible. This is kinda similar to what RISC-V is doing: see e_flags field in https://github.com/riscv/riscv-elf-psabi-doc/blob/master/riscv-elf.md Defining which ISAs can link together and if inlining can occur should be the responsibility of the toolchain and dynamic linker, rather than relying on ISA switching being a privileged operation. What I think should happen is that the ABI requires functions to have the ISA set to the proper value when calling a function, and if it isn't set properly, then that's a bug in the calling function. Likewise for returning. This is similar to how the thread pointer or stack pointer must be set properly before calling a function. If it isn't set properly, than it's undefined behavior and the code is allowed to do anything (though will probably crash). I had been planning on the actual RISC-V/Power ISA switching operation happen as part of switching between user and kernel modes, where all the RISC-V state is accessible in either standard Power registers or SFRs. For switching between standard Power and Power+SVPrefix, I was hoping that all of SVPrefix would just fit in unallocated portions of the Power ISA. For handling 16-bit instructions, would it work to instead define a pair of 16-bit instructions as a 32-bit instruction with the standard opcode field in the correct position for decoding to allow mixing with arbitrary >= 32-bit Power ISA operations? For example, if there were the following two 16-bit instructions: add 3, 3, 4 addi 3, 3, 12 They would both be packed into a single 32-bit instruction (denoted using braces): { add 3, 3, 4 addi 3, 3, 12 } We wouldn't need to support all instructions, just the most common ones, and it's perfectly fine to not support particular pairs (such as if the second one can trap and the first one can't just be run again -- allowing us to not need to worry about what happens if we trap in the middle of a pair).
(In reply to Jacob Lifshay from comment #1) i'll mention the elf flags idea. it may help at some point if you could explain it to Tim of RaptorCS. > For switching between standard Power and Power+SVPrefix, I was hoping that > all of SVPrefix would just fit in unallocated portions of the Power ISA. just. barely, yes. by using up 2 major opcodes per prefix. 2 for 48, 2 for 64 bit. > > For handling 16-bit instructions, would it work to instead define a pair of > 16-bit instructions as a 32-bit instruction with the standard opcode field > in the correct position for decoding to allow mixing with arbitrary >= > 32-bit Power ISA operations? only on BE systems because only then does the decoder see the major opcode in the 1st byte. ok ok if we wanted to make ourselves feel like jamming our fingers down our throat and hurling, we could go: "ok mr 32 bit LE system, we have reserved 2 major opcodes for C, and in byte 3 we have a C instruction. however there are 2 bytes left, 0 and 1, which is now the SECOND C instruction, to be executed AFTER the one in bytes 2 and 3". if those 2 bytes are all zero that is a C NOP so this would mean running instructions in a weird 2 steps forward, 1 step back fashion, but would at least be LE compatible. blech. SVPrefix32 (C with a SVP added) is also doable btw. that just takes up 32 bits, straight. > > For example, if there were the following two 16-bit instructions: > > add 3, 3, 4 > addi 3, 3, 12 > > They would both be packed into > a single 32-bit instruction (denoted using braces): > > { > add 3, 3, 4 > addi 3, 3, 12 > } > > We wouldn't need to support all instructions, just the most common ones, yes agreed. i'd like to ask rogier bruisse to help design it, he is extremely good. >and > it's perfectly fine to not support particular pairs (such as if the second > one can trap and the first one can't just be run again -- allowing us to not > need to worry about what happens if we trap in the middle of a pair). hmm... we might have to. or, another way (indicating the extent of the awfulness), a bit in a SPR which is contextswitched tells us which of the 2 C instructions are being executed. or... *shudder* we have the PC advance first by 4 bytes, then jump BACK 2 bytes, then jump forward to... no, my brain just melted thinking about that one.
(In reply to Luke Kenneth Casson Leighton from comment #2) > (In reply to Jacob Lifshay from comment #1) > > it's perfectly fine to not support particular pairs (such as if the second > > one can trap and the first one can't just be run again -- allowing us to not > > need to worry about what happens if we trap in the middle of a pair). > > hmm... we might have to. or, another way (indicating the extent of the > awfulness), a bit in a SPR which is contextswitched tells us which of the 2 > C instructions are being executed. That could work, but it will require additional modification to Linux as well as userspace for things like changing the on-stack structures for asynchronous signals. > > or... *shudder* we have the PC advance first by 4 bytes, then jump BACK 2 > bytes, then jump forward to... That would be very messy and won't work if any instructions don't start on a 4-byte boundary (can happen with 48-bit instructions), since that would be ambiguous between a misaligned 4-byte instruction and the second half of a compressed instruction pair. > ... no, my brain just melted thinking about that > one. Let's all join the melted brain club!!! :P
so, um... it's just easier to use BE instruction format and "cheat" a little, because the MSBs of the 32-bit instruction, containing the major opcode, end up in the 1st byte: byte 0: 31 30 29 28 27 26 25 24 | major opcode | rest-of-32-bit-instruction... byte 1: 23 22 21 20 19 18 17 16 more of 32-bit-instruction see https://libre-riscv.org/openpower/: if we take over 2 opcodes for each of C, SVP P48, SVP P64 and VBLOCK then we have a workaround. SVP P32 can potentially be "paged". it's a bit of a pain that there's so few bits, due to the way that POWER was never designed for this type of thing originally. the way VLE works is, you actually have an entire new memory page which is allocated a "format". that means that mixing 16-bit VLE and 32-bit VLE is basically impossible, and you have to jump between two completely different memory pages repeatedly just to get access to regular instructions which are not available in VLE 16-bit.
(In reply to Luke Kenneth Casson Leighton from comment #4) > so, um... it's just easier to use BE instruction format and "cheat" a > little, because the MSBs of the 32-bit instruction, containing the major > opcode, > end up in the 1st byte: > > byte 0: > 31 30 29 28 27 26 25 24 > | major opcode | rest-of-32-bit-instruction... > > byte 1: > 23 22 21 20 19 18 17 16 > more of 32-bit-instruction > > see https://libre-riscv.org/openpower/: > > if we take over 2 opcodes for each of C, SVP P48, SVP P64 and VBLOCK > then we have a workaround. If we use the same 2 C opcodes for compressed pairs, then we can end up with 13.5 bits available per instruction, rather than just 11, which allows us to define about 5.656 (!) times as many compressed instructions. 13.5 = (32 - 6 + 1) / 2 11 = 16 - 6 + 1 5.656... = 2 ^ (13.5 - 11) > SVP P32 can potentially be "paged". > > it's a bit of a pain that there's so few bits, due to the way that POWER > was never designed for this type of thing originally. > > the way VLE works is, you actually have an entire new memory page which > is allocated a "format". that means that mixing 16-bit VLE and 32-bit > VLE is basically impossible, and you have to jump between two completely > different memory pages repeatedly just to get access to regular > instructions which are not available in VLE 16-bit. VLE requiring separate pages seems like a total mess that we should not emulate.
(In reply to Jacob Lifshay from comment #5) > If we use the same 2 C opcodes for compressed pairs, then we can end up with > 13.5 bits available per instruction, rather than just 11, which allows us to > define about 5.656 (!) times as many compressed instructions. > > 13.5 = (32 - 6 + 1) / 2 > 11 = 16 - 6 + 1 > 5.656... = 2 ^ (13.5 - 11) other ideas along this theme include saying, in that initial 32-bit space, is to specify how many of the next instructions are to be encoded as C. i'd recommend to reserve the option to do this, as it involves storing state (the countdown timer) whereas just having 2 C opcodes as pairs is dead simple, as long as you accept that the PC encodes the state information about the fact that if the PC is on a 2-byte boundary you're in the middle of the pair. this allows the PC to store the state if there is a trap. > VLE requiring separate pages seems like a total mess that we should not > emulate. i only looked at it briefly so we can't write it off entirely, however... yeah :)
Storing state sounds dangerous, consider someone jumping into the middle of compressed instrs.
(In reply to Luke Kenneth Casson Leighton from comment #6) > (In reply to Jacob Lifshay from comment #5) > > > If we use the same 2 C opcodes for compressed pairs, then we can end up with > > 13.5 bits available per instruction, rather than just 11, which allows us to > > define about 5.656 (!) times as many compressed instructions. > > > > 13.5 = (32 - 6 + 1) / 2 > > 11 = 16 - 6 + 1 > > 5.656... = 2 ^ (13.5 - 11) > > other ideas along this theme include saying, in that initial 32-bit space, > is to specify how many of the next instructions are to be encoded as C. > > i'd recommend to reserve the option to do this, as it involves storing state > (the countdown timer) whereas just having 2 C opcodes as pairs is dead > simple, as long as you accept that the PC encodes the state information > about the fact that if the PC is on a 2-byte boundary you're in the middle > of the pair. > > this allows the PC to store the state if there is a trap. Except, like I mentioned before, using the PC to store that info makes it ambiguous between executing the second instruction in a pair and executing a non-paired instruction that happens to start at a 2-byte boundary because there was a preceding 48-bit instruction. So, we'll either have to not have any 48-bit instructions, not trap in the middle of a pair, or not have paired instructions. Switching instruction memory to always be in BE mode is not a good solution since that is not compatible with standard Power code that doesn't know about the endian switch.
(In reply to cand from comment #7) > Storing state sounds dangerous, consider someone jumping into the middle of > compressed instrs. Just as dangerous as someone jumping into the middle of an x86 instruction. It's not at all a problem for compilers, since they can arrange for all branch targets to not be in the middle of compressed instructions.
(In reply to cand from comment #7) > Storing state sounds dangerous, consider someone jumping into the middle of > compressed instrs. this is precisely what a context-switch has to be able to do, because a trap may occur in the middle of a pair of C instrs. i very carefully designed VBLOCK around precisely this principle, involving state information that is required to be context-switched. and yes, if an end-user writes code that jumps into the middle of a block without respecting the specification, they're on their own.
(In reply to Jacob Lifshay from comment #8) > (In reply to Luke Kenneth Casson Leighton from comment #6) > > (In reply to Jacob Lifshay from comment #5) > > > > > If we use the same 2 C opcodes for compressed pairs, then we can end up with > > > 13.5 bits available per instruction, rather than just 11, which allows us to > > > define about 5.656 (!) times as many compressed instructions. > > > > > > 13.5 = (32 - 6 + 1) / 2 > > > 11 = 16 - 6 + 1 > > > 5.656... = 2 ^ (13.5 - 11) > > > > other ideas along this theme include saying, in that initial 32-bit space, > > is to specify how many of the next instructions are to be encoded as C. > > > > i'd recommend to reserve the option to do this, as it involves storing state > > (the countdown timer) whereas just having 2 C opcodes as pairs is dead > > simple, as long as you accept that the PC encodes the state information > > about the fact that if the PC is on a 2-byte boundary you're in the middle > > of the pair. > > > > this allows the PC to store the state if there is a trap. > > Except, like I mentioned before, using the PC to store that info makes it > ambiguous between executing the second instruction in a pair and executing a > non-paired instruction that happens to start at a 2-byte boundary because > there was a preceding 48-bit instruction. *click* yes got it now. ok so a bit (a "length of number of C instructions to be decoded" where that length happens to be 1, is a solution there. > So, we'll either have to not have > any 48-bit instructions, not trap in the middle of a pair, or not have > paired instructions. storing an extra bit (or bits) in a SPR - particularly one that already has to be saved - isn't so onerous. > Switching instruction memory to always be in BE mode is not a good solution > since that is not compatible with standard Power code that doesn't know > about the endian switch. well, luckily, if we accept the C-pair idea has to have at least a length of "1 bit" and that is stored in a context-switched SPR, it's ok. later the length can be extended to say 3 or 4 bits. or, heck, we might as well just allow 3-4 bits anyway.