RFC for crternlogi, ternlogi, binlut and crbinlut https://libre-soc.org/openpower/sv/bitmanip/#index4h1
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Tue Mar 7 14:15:35 2023 +0000 add ls007 stub RFC bug https://bugs.libre-soc.org/show_bug.cgi?id=1017 https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=c3bbcb1791e16b223a9ce9d2551cee6af2dfbdbf
Author: Jacob Lifshay <programmerjake@gmail.com> Date: Thu Mar 9 17:29:13 2023 -0800 WIP adding ternlogi to ls007 https://git.libre-soc.org/?p=libreriscv.git;a=commit;h=abe8994372114774e5b315bbc4ff0cd1c0d36345 added ternlogi, still need to add crternlogi and [cr]binlog
> TODO: should we use imm7 instead of imm8 for `ternlogi`? This wouldn't make the decoder significantly more complex, since the immediate doesn't affect wether or not the instruction is defined. <https://libre-soc.org/openpower/sv/bitmanip/ternlogi_simplification_experiment/> i already said no and i already explained why not.
added todo for changing crbinlog lut argument to gpr: https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=a987b9ff6f1e9a151d8af257ffdf71a11bbaa219
in the RFC is not an appropriate place to put these statements: +* TODO: imho taking a LUT from a CR isn't useful, therefore `bincrlut` should + be dropped and `crbinlog` should be changed to take its LUT argument from a GPR. + <https://libre-soc.org/irclog/%23libre-soc.2023-03-10.log.html#t2023-03-10T21:10:56> the reason for using the CR is because it is 4 bits. reading a GPR requires 64-bit port reads of which 60 bits are discarded. with the crweird ops, CR field operations become much more powerful. https://libre-soc.org/openpower/sv/cr_int_predication/
(In reply to Jacob Lifshay from comment #4) > added todo for changing crbinlog lut argument to gpr: > https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff; > h=a987b9ff6f1e9a151d8af257ffdf71a11bbaa219 reverted as it is not appropriate to use the actual RFC itself for the purposes of discussion. a discussion page, the bugtracker, or the IRC channel is the appropriate place to use for discussion.
(In reply to Luke Kenneth Casson Leighton from comment #6) > (In reply to Jacob Lifshay from comment #4) > > added todo for changing crbinlog lut argument to gpr: > > https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff; > > h=a987b9ff6f1e9a151d8af257ffdf71a11bbaa219 > > reverted as it is not appropriate to use the actual RFC itself > for the purposes of discussion. a discussion page, the bugtracker, > or the IRC channel is the appropriate place to use for discussion. that wasn't intended to have the discussion on the RFC, but instead to serve as a marker to not submit the RFC until the discussion concluded and to apply the results of the discussion to the RFC.
(In reply to Luke Kenneth Casson Leighton from comment #5) > in the RFC is not an appropriate place to put these statements: > > +* TODO: imho taking a LUT from a CR isn't useful, therefore `bincrlut` > should > + be dropped and `crbinlog` should be changed to take its LUT argument from > a GPR. > + > <https://libre-soc.org/irclog/%23libre-soc.2023-03-10.log.html#t2023-03- > 10T21:10:56> > > the reason for using the CR is because it is 4 bits. > reading a GPR requires 64-bit port reads of which > 60 bits are discarded. > > with the crweird ops, CR field operations become > much more powerful. those are true, however that doesn't change that those operations really should be using GPRs for the lookup table: GPRs are *still* much more powerful than CRs even with crweird ops, CRs are intended for boolean values hence why they don't support much other than boolean operations. GPRs are much more general, they also support shifts, grevs, rotates, loads/stores, etc. all of which are operations you'd generally want to run on the look-up-table, CRs support none of those, so by having no crbinlog instruction that takes the look-up-table from GPRs you've effectively forced the programmer to almost always need a separate GPR to CR move operation for every crbinlog, making it about twice as expensive as necessary. this imho is kinda like adding an instruction to load pi or tau as a fp constant, except deciding that the destination register must only be the CTR register even though almost all uses of pi or tau require the value in a FPR...can you see my point? imho it's fine to have an instruction that takes the look-up-table from a CR (since that's occasionally what you want) but *only* as long as there's always a corresponding instruction that takes the look-up-table from a GPR. if there isn't, I expect the ISA WG to complain and possibly reject crbinlog and friends completely.
Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Mar 13 16:40:35 2023 -0700 add binlog to ls007 https://git.libre-soc.org/?p=libreriscv.git;a=commit;h=45bf4930aec86ac11b2773421a315ec3e560f68e
Author: Jacob Lifshay <programmerjake@gmail.com> Date: Mon Mar 13 17:01:28 2023 -0700 add crternlogi to ls007 https://git.libre-soc.org/?p=libreriscv.git;a=commit;h=cfe4cc313cb9d5693af96a6b775cb1abb932c945
(In reply to Jacob Lifshay from comment #8) > those are true, however that doesn't change that those operations really > should be using GPRs for the lookup table: no. see comment #5. > GPRs are *still* much more powerful than CRs and more power-hungry. > all of which are operations you'd > generally want to run on the look-up-table, then use binlut followed by a single crweird transfer.
it just occurred to me that all the binlog ops don't have many uses cases (only one I could think of is simulating a fpga which is a very niche use case), therefore should not be submitted in the same RFC as the ternlogi ops to avoid ternlogi getting rejected.
(In reply to Luke Kenneth Casson Leighton from comment #11) > (In reply to Jacob Lifshay from comment #8) > > > those are true, however that doesn't change that those operations really > > should be using GPRs for the lookup table: > > no. see comment #5. My response already took comment #5 into account. > > > GPRs are *still* much more powerful than CRs > > and more power-hungry. doing gpr -> cr move and then crbinlut almost certainly uses *waay more power* than just doing crbinlut with the lut in a GPR: the additional power needed by fetch/decode/dispatch/operand-tracking for a whole separate gpr->cr move op almost certainly dwarfs the power needed to send 64-bits instead of 4. > > > all of which are operations you'd > > generally want to run on the look-up-table, > > then use binlut followed by a single crweird transfer. that doesn't work well, i'm assuming all of the non-lut sources/dests are CRs, hence why we're trying to use crbinlog. you'd need 4 additional instructions.
drat, drat. crternlogi and crbinlut are 4-in 1-out on CR Fields. assuming 32-bit width does not help for SVP64. https://bugs.libre-soc.org/show_bug.cgi?id=1023#c1
(In reply to Jacob Lifshay from comment #12) > it just occurred to me that all the binlog ops don't have many uses cases > (only one I could think of is simulating a fpga which is a very niche use > case), therefore should not be submitted in the same RFC as the ternlogi ops > to avoid ternlogi getting rejected. between the lack of decent use cases and the gpr vs. cr lut and the unacceptable 4-in 1-out issues, imho we should remove all [cr]binlog ops from this rfc since they need more design work. imho [cr]ternlogi should stay in after fixing crternlogi to be 3-in 1-out.
(In reply to Jacob Lifshay from comment #15) > between the lack of decent use cases and the gpr vs. cr lut and the > unacceptable 4-in 1-out issues, imho we should remove all [cr]binlog ops > from this rfc since they need more design work. to all future readers of this comment above: jacob has Asperger's and also fundamentally lacks sufficient technical knowledge of Micro-Architectures to make such judgement calls. jacob please refrain from making such judgements in future without full input. we have been over this before many times: you have next-to-zero knowledge and experience of Hardware Design and consequently make completely mis-informed assessments.
(In reply to Luke Kenneth Casson Leighton from comment #16) > (In reply to Jacob Lifshay from comment #15) > > > between the lack of decent use cases and the gpr vs. cr lut and the > > unacceptable 4-in 1-out issues, imho we should remove all [cr]binlog ops > > from this rfc since they need more design work. > > to all future readers of this comment above: jacob has Asperger's note that I don't actually, I did get tested. > and also > fundamentally lacks sufficient technical knowledge of Micro-Architectures > to make such judgement calls. I disagree.
(In reply to Jacob Lifshay from comment #13) > > then use binlut followed by a single crweird transfer. > > that doesn't work well, i'm assuming all of the non-lut sources/dests are > CRs, hence why we're trying to use crbinlog. you'd need 4 additional > instructions. have a look at Tom Forsyth's video on Larrabee as to why that may be desirable. [crweirds are almost certainly going to need to be micro-coded btw] Tom explains that the team found a perfect pre-existing suite of instructions in the original Pentium III core, for use as predicate masks, but the distance from the gates inside that core over to the AVX512 units was so great that it would have required slowing down the entire ASIC by an order of magnitude in order to allow speed of light propagation of signals to cross the chip. they therefore were forced into the situation of adding an entire new suite of instructions, duplicating a perfectly good set, adding literally an entire new regfile that coud be placed closer to the AVX512 pipelines, just to get themselves some Predicate Mask operations. now with crbinlut being near-identical (except 4 bits at a time) and crternlogi likewise to the CRops suite, proposing *CR* based advanced variants of that suite is likely to be well-received by the IBM Hardware Architect team. a GPR-based version, especially when it wastes 60 bits out of 64, and especially as it will cause new datapaths to be created between the CR pipelines and the GPR Regfile, whose distance may be extremely long in IBM's layout, will go down badly. we are in other words taking a huge risk by loading CR Fields as Predicate Masks, but at the same time they are perfect for the job. it is a balance that requires some care and some modelling and guesswork of how IBM's extremely large IC might be designed, and to take that *existing* design into consideration. those "extra" instructions provide a clean RISC-paradigm firebreak between pipeline units whose distance may be simply too far apart. forcing the CR and GPR regfile and pipelines to be close to each other may not go down well. [i mentioned that crweirds may need to be microoded because they are an exception to the SVP64 rules: 64 results from 64 CRweird element operations can go into *one single* Scalar GPR.]
(In reply to Luke Kenneth Casson Leighton from comment #18) > a GPR-based version, especially when it wastes 60 bits out of 64, > and especially as it will cause new datapaths to be created between > the CR pipelines and the GPR Regfile, whose distance may be extremely > long in IBM's layout, will go down badly. Well, you're missing some critical differences between SVP64 and AVX512: CRs already have to be close to GPRs because CRs are used basically everywhere in instructions that use both GPRs and CRs: isel -- x86's equivalents are very common. cmp -- extremely common setbc -- will be used basically everywhere (x86's equivalents are, compilers have not caught up yet and still generate a 7-8 insn sequence of shifts, adde, and rotates for bool = i64 < i64) every Rc=1 insn -- basically everywhere. predicate registers are not close to GPRs in AVX512 because they're close *to the simd registers*, in SVP64, those *are GPRs*. so, adding SVP64 makes having CRs close to GPRs probably even more important than ever. > > we are in other words taking a huge risk by loading CR Fields as > Predicate Masks, but at the same time they are perfect for the > job. it is a balance that requires some care and some modelling and > guesswork of how IBM's extremely large IC might be designed, and > to take that *existing* design into consideration. > > those "extra" instructions provide a clean RISC-paradigm firebreak > between pipeline units whose distance may be simply too far apart. > forcing the CR and GPR regfile and pipelines to be close to each other > may not go down well. the way I see it, if they decide to substantially separate CR and GPR reg files, that won't go down well, because of how often conditional branches are used that source their data from cmp/Rc=1 ops: according to https://oscarlab.github.io/papers/instrpop-systor19.pdf on x86-64: * test/cmp is 7.3% of all instructions * je/jne is 7.5% of all instructions so i likewise expect about 7% of *all* ppc64le instructions to be compares or Rc=1 from GPRs into CRs so they can be used by a conditional branch.
notes on crbinlut/crternlogi one of the unit tests is to be Warshall Transitive Closure, using Matrix REMAP. Matrix may however need Inner-Outer Product Mode before that can happen. see bug #1027 (TODO edit crossref to actual subtask) also useful for JIT runtime, as well as FPGA-style emulation. these are incredibly powerful instructions that are going to take time to emerge uses for.
(In reply to Luke Kenneth Casson Leighton from comment #20) > also useful for JIT runtime if you meant [cr]binlog, please explain, it is non-obvious to me. remember JIT is basically the same as any other compiler, it just runs at a different time, so if binlog is useful for a JIT compiler, it'll be equally useful for ahead-of-time compilers.
commit b486fb63e198031e89a6e6e674e82fde15385fd6 (HEAD -> master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Mon Mar 20 13:19:14 2023 +0000 add wording section to crternlogi ls007 https://bugs.libre-soc.org/show_bug.cgi?id=1017 commit 11c7c7fff60096431d5b60ab928d2b7ea9878527 Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Mon Mar 20 12:45:48 2023 +0000 update ternlogi to be identical to xxeval (ls007), to make it clear that this is supposed to be exactly like xxeval. even refer to the same table https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=b486fb63e198031e89a6e6e674e82fde15385fd6
commit 2672abf1129c6470d519ac82c4e91a8552855690 (HEAD -> master, origin/master, origin/HEAD) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Mon Mar 20 13:43:30 2023 +0000 add wording for binlut crbinlut to ls007 https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=2672abf1129c6470d519ac82c4e91a8552855690
luke, the reason i dropped the argument for now over crbinlut was that i understood we were not submitting any binlut instructions until a later rfc. please remove them from ls007 otherwise I will be forced to argue since I still don't agree with only having a crbinlut instruction that uses a CR as the LUT.
(In reply to Jacob Lifshay from comment #24) > still don't agree with only having a crbinlut instruction that uses a CR as > the LUT. i have explained adequately that those reasons are invalid due to lack of knowledge on your part. the instructions remain in. if you do not understand then please ask questions to clarify - just not under this bugreport, it is dedicated to the writing of the RFC.
(In reply to Luke Kenneth Casson Leighton from comment #25) > (In reply to Jacob Lifshay from comment #24) > > > still don't agree with only having a crbinlut instruction that uses a CR as > > the LUT. > the instructions remain in. in that case, please let me insert a note to the isa wg that we considered the alternative of also/instead having a crbinlog that uses a gpr for the lut but you overrode my objections, because otherwise i consider intentionally submitting something with known easily-fixable flaws from my perspective and intentionally hiding them to be unethical and I can not sign off on that. I am not trying to say you are unethical, because you honestly believe those to not be flaws. > i have explained adequately that those reasons are invalid due to lack of > knowledge on your part. I understand you point, however I know that your justification of power usage is almost certainly incorrect because of nearly all usage of crbinlog requiring a separate instruction to move the lut from a gpr immediately before crbinlog...executing a full separate instruction almost certainly requires substantially more power than having crbinlog read the lut from a GPR directly. also, there are ease-of-use and code size considerations due to the extra gpr -> cr move usually being required.
(In reply to Jacob Lifshay from comment #26) > (In reply to Luke Kenneth Casson Leighton from comment #25) > > (In reply to Jacob Lifshay from comment #24) > > > > > still don't agree with only having a crbinlut instruction that uses a CR as > > > the LUT. > > > the instructions remain in. > > in that case, please let me insert a note to the isa wg that we considered > the alternative of also/instead having a crbinlog that uses a gpr for the > lut and it was rejected with 100% rational and reasonable logical technical arguments based on the experience of Tom Forsyth and an *extremely detailed* explanation based on a comprehensive reasonable intelligent analysis of the potential design of POWER9 and POWER10... ... and you completely goddamned ignored every single word i wrote. this repeated pattern from you of ignoring my technical explanations is getting absolutely intolerable, i'm not going to be able to put up with it much longer. > but you overrode my objections, because otherwise i consider > intentionally submitting something with known easily-fixable flaws from my > perspective and intentionally hiding them to be unethical and I can not sign > off on that. jacob: that is the worst possible thing that you could possibly do. please re-read the message that i sent to you privately last week, in full, and please do not discuss this again publicly. thank you for respecting what you are being requested to do. please cease and desist pursuing this any further, the matter is closed.
i changed my mind on crbinlog, luke's argument that it would make the dependency matrix too large for CR/GPR instructions because there are no other instructions that both read/write multiple CR fields and read a GPR makes sense to me -- sad from a software perspective, but, oh well. https://libre-soc.org/irclog/%23libre-soc.2023-03-21.log.html So, no note needed, we now both agree that crbinlog should take the lut argument from a CR.
submittee 2023mar22