Bug 1017 - ISA WG RFC ls007 for binary and ternary bitops
Summary: ISA WG RFC ls007 for binary and ternary bitops
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Jacob Lifshay
URL: https://libre-soc.org/openpower/sv/rf...
Depends on: 1023
Blocks: 1069
  Show dependency treegraph
 
Reported: 2023-03-07 14:12 GMT by Luke Kenneth Casson Leighton
Modified: 2024-01-21 23:18 GMT (History)
2 users (show)

See Also:
NLnet milestone: NLnet.2022-08-051.OPF
total budget (EUR) for completion of task and all subtasks: 2000
budget (EUR) for this task, excluding subtasks' budget: 2000
parent task for budget allocation: 1009
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
[jacob] amount = 1000 submitted = 2023-06-28 paid = 2023-07-12 [lkcl] amount = 1000 submitted = 2023-06-22 paid = 2023-06-25


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2023-03-07 14:12:34 GMT
RFC for crternlogi, ternlogi, binlut and crbinlut
https://libre-soc.org/openpower/sv/bitmanip/#index4h1
Comment 1 Luke Kenneth Casson Leighton 2023-03-07 14:16:28 GMT
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Tue Mar 7 14:15:35 2023 +0000

    add ls007 stub RFC
    bug https://bugs.libre-soc.org/show_bug.cgi?id=1017

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=c3bbcb1791e16b223a9ce9d2551cee6af2dfbdbf
Comment 2 Jacob Lifshay 2023-03-10 01:33:01 GMT
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu Mar 9 17:29:13 2023 -0800

    WIP adding ternlogi to ls007

https://git.libre-soc.org/?p=libreriscv.git;a=commit;h=abe8994372114774e5b315bbc4ff0cd1c0d36345

added ternlogi, still need to add crternlogi and [cr]binlog
Comment 3 Luke Kenneth Casson Leighton 2023-03-10 05:39:19 GMT
> TODO: should we use imm7 instead of imm8 for `ternlogi`? This wouldn't make the decoder significantly more complex, since the immediate doesn't affect wether or not the instruction is defined. <https://libre-soc.org/openpower/sv/bitmanip/ternlogi_simplification_experiment/>

i already said no and i already explained why not.
Comment 4 Jacob Lifshay 2023-03-10 22:18:01 GMT
added todo for changing crbinlog lut argument to gpr:
https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=a987b9ff6f1e9a151d8af257ffdf71a11bbaa219
Comment 5 Luke Kenneth Casson Leighton 2023-03-11 03:54:04 GMT
in the RFC is not an appropriate place to put these statements:

+* TODO: imho taking a LUT from a CR isn't useful, therefore `bincrlut` should
+  be dropped and `crbinlog` should be changed to take its LUT argument from a GPR.
+  <https://libre-soc.org/irclog/%23libre-soc.2023-03-10.log.html#t2023-03-10T21:10:56>

the reason for using the CR is because it is 4 bits.
reading a GPR requires 64-bit port reads of which
60 bits are discarded.

with the crweird ops, CR field operations become
much more powerful.

https://libre-soc.org/openpower/sv/cr_int_predication/
Comment 6 Luke Kenneth Casson Leighton 2023-03-11 03:55:04 GMT
(In reply to Jacob Lifshay from comment #4)
> added todo for changing crbinlog lut argument to gpr:
> https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;
> h=a987b9ff6f1e9a151d8af257ffdf71a11bbaa219

reverted as it is not appropriate to use the actual RFC itself
for the purposes of discussion. a discussion page, the bugtracker,
or the IRC channel is the appropriate place to use for discussion.
Comment 7 Jacob Lifshay 2023-03-13 21:24:26 GMT
(In reply to Luke Kenneth Casson Leighton from comment #6)
> (In reply to Jacob Lifshay from comment #4)
> > added todo for changing crbinlog lut argument to gpr:
> > https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;
> > h=a987b9ff6f1e9a151d8af257ffdf71a11bbaa219
> 
> reverted as it is not appropriate to use the actual RFC itself
> for the purposes of discussion. a discussion page, the bugtracker,
> or the IRC channel is the appropriate place to use for discussion.

that wasn't intended to have the discussion on the RFC, but instead to serve as a marker to not submit the RFC until the discussion concluded and to apply the results of the discussion to the RFC.
Comment 8 Jacob Lifshay 2023-03-13 21:42:11 GMT
(In reply to Luke Kenneth Casson Leighton from comment #5)
> in the RFC is not an appropriate place to put these statements:
> 
> +* TODO: imho taking a LUT from a CR isn't useful, therefore `bincrlut`
> should
> +  be dropped and `crbinlog` should be changed to take its LUT argument from
> a GPR.
> + 
> <https://libre-soc.org/irclog/%23libre-soc.2023-03-10.log.html#t2023-03-
> 10T21:10:56>
> 
> the reason for using the CR is because it is 4 bits.
> reading a GPR requires 64-bit port reads of which
> 60 bits are discarded.
> 
> with the crweird ops, CR field operations become
> much more powerful.

those are true, however that doesn't change that those operations really should be using GPRs for the lookup table:
GPRs are *still* much more powerful than CRs even with crweird ops, CRs are intended for boolean values hence why they don't support much other than boolean operations. GPRs are much more general, they also support shifts, grevs, rotates, loads/stores, etc. all of which are operations you'd generally want to run on the look-up-table, CRs support none of those, so by having no crbinlog instruction that takes the look-up-table from GPRs you've effectively forced the programmer to almost always need a separate GPR to CR move operation for every crbinlog, making it about twice as expensive as necessary.

this imho is kinda like adding an instruction to load pi or tau as a fp constant, except deciding that the destination register must only be the CTR register even though almost all uses of pi or tau require the value in a FPR...can you see my point?

imho it's fine to have an instruction that takes the look-up-table from a CR (since that's occasionally what you want) but *only* as long as there's always a corresponding instruction that takes the look-up-table from a GPR. if there isn't, I expect the ISA WG to complain and possibly reject crbinlog and friends completely.
Comment 9 Jacob Lifshay 2023-03-13 23:43:04 GMT
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Mar 13 16:40:35 2023 -0700

    add binlog to ls007

https://git.libre-soc.org/?p=libreriscv.git;a=commit;h=45bf4930aec86ac11b2773421a315ec3e560f68e
Comment 10 Jacob Lifshay 2023-03-14 00:02:59 GMT
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Mon Mar 13 17:01:28 2023 -0700

    add crternlogi to ls007

https://git.libre-soc.org/?p=libreriscv.git;a=commit;h=cfe4cc313cb9d5693af96a6b775cb1abb932c945
Comment 11 Luke Kenneth Casson Leighton 2023-03-14 09:37:37 GMT
(In reply to Jacob Lifshay from comment #8)

> those are true, however that doesn't change that those operations really
> should be using GPRs for the lookup table:

no.  see comment #5.

> GPRs are *still* much more powerful than CRs 

and more power-hungry.

> all of which are operations you'd
> generally want to run on the look-up-table,

then use binlut followed by a single crweird transfer.
Comment 12 Jacob Lifshay 2023-03-14 11:22:12 GMT
it just occurred to me that all the binlog ops don't have many uses cases (only one I could think of is simulating a fpga which is a very niche use case), therefore should not be submitted in the same RFC as the ternlogi ops to avoid ternlogi getting rejected.
Comment 13 Jacob Lifshay 2023-03-14 11:31:41 GMT
(In reply to Luke Kenneth Casson Leighton from comment #11)
> (In reply to Jacob Lifshay from comment #8)
> 
> > those are true, however that doesn't change that those operations really
> > should be using GPRs for the lookup table:
> 
> no.  see comment #5.

My response already took comment #5 into account.
> 
> > GPRs are *still* much more powerful than CRs 
> 
> and more power-hungry.

doing gpr -> cr move and then crbinlut almost certainly uses *waay more power* than just doing crbinlut with the lut in a GPR: the additional power needed by fetch/decode/dispatch/operand-tracking for a whole separate gpr->cr move op almost certainly dwarfs the power needed to send 64-bits instead of 4.
> 
> > all of which are operations you'd
> > generally want to run on the look-up-table,
> 
> then use binlut followed by a single crweird transfer.

that doesn't work well, i'm assuming all of the non-lut sources/dests are CRs, hence why we're trying to use crbinlog. you'd need 4 additional instructions.
Comment 14 Luke Kenneth Casson Leighton 2023-03-14 15:31:45 GMT
drat, drat. crternlogi and crbinlut are 4-in 1-out on CR Fields.
assuming 32-bit width does not help for SVP64.
https://bugs.libre-soc.org/show_bug.cgi?id=1023#c1
Comment 15 Jacob Lifshay 2023-03-14 22:20:35 GMT
(In reply to Jacob Lifshay from comment #12)
> it just occurred to me that all the binlog ops don't have many uses cases
> (only one I could think of is simulating a fpga which is a very niche use
> case), therefore should not be submitted in the same RFC as the ternlogi ops
> to avoid ternlogi getting rejected.

between the lack of decent use cases and the gpr vs. cr lut and the unacceptable 4-in 1-out issues, imho we should remove all [cr]binlog ops from this rfc since they need more design work.

imho [cr]ternlogi should stay in after fixing crternlogi to be 3-in 1-out.
Comment 16 Luke Kenneth Casson Leighton 2023-03-15 06:54:07 GMT
(In reply to Jacob Lifshay from comment #15)

> between the lack of decent use cases and the gpr vs. cr lut and the
> unacceptable 4-in 1-out issues, imho we should remove all [cr]binlog ops
> from this rfc since they need more design work.

to all future readers of this comment above: jacob has Asperger's and also
fundamentally lacks sufficient technical knowledge of Micro-Architectures
to make such judgement calls.

jacob please refrain from making such judgements in future without full
input. we have been over this before many times: you have next-to-zero
knowledge and experience of Hardware Design and consequently make completely
mis-informed assessments.
Comment 17 Jacob Lifshay 2023-03-15 08:27:10 GMT
(In reply to Luke Kenneth Casson Leighton from comment #16)
> (In reply to Jacob Lifshay from comment #15)
> 
> > between the lack of decent use cases and the gpr vs. cr lut and the
> > unacceptable 4-in 1-out issues, imho we should remove all [cr]binlog ops
> > from this rfc since they need more design work.
> 
> to all future readers of this comment above: jacob has Asperger's

note that I don't actually, I did get tested.

> and also
> fundamentally lacks sufficient technical knowledge of Micro-Architectures
> to make such judgement calls.

I disagree.
Comment 18 Luke Kenneth Casson Leighton 2023-03-15 12:28:37 GMT
(In reply to Jacob Lifshay from comment #13)

> > then use binlut followed by a single crweird transfer.
> 
> that doesn't work well, i'm assuming all of the non-lut sources/dests are
> CRs, hence why we're trying to use crbinlog. you'd need 4 additional
> instructions.

have a look at Tom Forsyth's video on Larrabee as to why that may
be desirable.

[crweirds are almost certainly going to need to be micro-coded btw]

Tom explains that the team found a perfect pre-existing suite of
instructions in the original Pentium III core, for use as predicate
masks, but the distance from the gates inside that core over to the
AVX512 units was so great that it would have required slowing down the
entire ASIC by an order of magnitude in order to allow speed of light
propagation of signals to cross the chip.

they therefore were forced into the situation of adding an entire new
suite of instructions, duplicating a perfectly good set, adding
literally an entire new regfile that coud be placed closer to the
AVX512 pipelines, just to get themselves some Predicate Mask operations.

now with crbinlut being near-identical (except 4 bits at a time)
and crternlogi likewise to the CRops suite, proposing *CR* based
advanced variants of that suite is likely to be well-received
by the IBM Hardware Architect team.

a GPR-based version, especially when it wastes 60 bits out of 64,
and especially as it will cause new datapaths to be created between
the CR pipelines and the GPR Regfile, whose distance may be extremely
long in IBM's layout, will go down badly.

we are in other words taking a huge risk by loading CR Fields as
Predicate Masks, but at the same time they are perfect for the
job. it is a balance that requires some care and some modelling and
guesswork of how IBM's extremely large IC might be designed, and
to take that *existing* design into consideration.

those "extra" instructions provide a clean RISC-paradigm firebreak
between pipeline units whose distance may be simply too far apart.
forcing the CR and GPR regfile and pipelines to be close to each other
may not go down well.

[i mentioned that crweirds may need to be microoded because they are
an exception to the SVP64 rules: 64 results from 64 CRweird element
operations can go into *one single* Scalar GPR.]
Comment 19 Jacob Lifshay 2023-03-15 21:41:23 GMT
(In reply to Luke Kenneth Casson Leighton from comment #18)
> a GPR-based version, especially when it wastes 60 bits out of 64,
> and especially as it will cause new datapaths to be created between
> the CR pipelines and the GPR Regfile, whose distance may be extremely
> long in IBM's layout, will go down badly.

Well, you're missing some critical differences between SVP64 and AVX512: CRs already have to be close to GPRs because CRs are used basically everywhere in instructions that use both GPRs and CRs:
isel -- x86's equivalents are very common.
cmp -- extremely common
setbc -- will be used basically everywhere (x86's equivalents are, compilers have not caught up yet and still generate a 7-8 insn sequence of shifts, adde, 
and rotates for bool = i64 < i64)
every Rc=1 insn -- basically everywhere.

predicate registers are not close to GPRs in AVX512 because they're close *to the simd registers*, in SVP64, those *are GPRs*. so, adding SVP64 makes having CRs close to GPRs probably even more important than ever.
> 
> we are in other words taking a huge risk by loading CR Fields as
> Predicate Masks, but at the same time they are perfect for the
> job. it is a balance that requires some care and some modelling and
> guesswork of how IBM's extremely large IC might be designed, and
> to take that *existing* design into consideration.
> 
> those "extra" instructions provide a clean RISC-paradigm firebreak
> between pipeline units whose distance may be simply too far apart.
> forcing the CR and GPR regfile and pipelines to be close to each other
> may not go down well.

the way I see it, if they decide to substantially separate CR and GPR reg files, that won't go down well, because of how often conditional branches are used that source their data from cmp/Rc=1 ops:

according to https://oscarlab.github.io/papers/instrpop-systor19.pdf
on x86-64:
* test/cmp is 7.3% of all instructions
* je/jne is 7.5% of all instructions

so i likewise expect about 7% of *all* ppc64le instructions to be compares or Rc=1 from GPRs into CRs so they can be used by a conditional branch.
Comment 20 Luke Kenneth Casson Leighton 2023-03-17 12:13:56 GMT
notes on crbinlut/crternlogi one of the unit tests is to be Warshall
Transitive Closure, using Matrix REMAP. Matrix may however need
Inner-Outer Product Mode before that can happen. see bug #1027
(TODO edit crossref to actual subtask)

also useful for JIT runtime, as well as FPGA-style emulation.
these are incredibly powerful instructions that are going to
take time to emerge uses for.
Comment 21 Jacob Lifshay 2023-03-17 20:42:57 GMT
(In reply to Luke Kenneth Casson Leighton from comment #20)
> also useful for JIT runtime

if you meant [cr]binlog, please explain, it is non-obvious to me. remember JIT is basically the same as any other compiler, it just runs at a different time, so if binlog is useful for a JIT compiler, it'll be equally useful for ahead-of-time compilers.
Comment 22 Luke Kenneth Casson Leighton 2023-03-20 13:19:59 GMT
commit b486fb63e198031e89a6e6e674e82fde15385fd6 (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Mon Mar 20 13:19:14 2023 +0000

    add wording section to crternlogi ls007
    https://bugs.libre-soc.org/show_bug.cgi?id=1017

commit 11c7c7fff60096431d5b60ab928d2b7ea9878527
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Mon Mar 20 12:45:48 2023 +0000

    update ternlogi to be identical to xxeval (ls007), to make it clear
    that this is supposed to be exactly like xxeval.
    even refer to the same table

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=b486fb63e198031e89a6e6e674e82fde15385fd6
Comment 23 Luke Kenneth Casson Leighton 2023-03-20 13:43:58 GMT
commit 2672abf1129c6470d519ac82c4e91a8552855690 (HEAD -> master, origin/master, origin/HEAD)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Mon Mar 20 13:43:30 2023 +0000

    add wording for binlut crbinlut to ls007

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=2672abf1129c6470d519ac82c4e91a8552855690
Comment 24 Jacob Lifshay 2023-03-20 18:59:36 GMT
luke, the reason i dropped the argument for now over crbinlut was that i understood we were not submitting any binlut instructions until a later rfc. please remove them from ls007 otherwise I will be forced to argue since I still don't agree with only having a crbinlut instruction that uses a CR as the LUT.
Comment 25 Luke Kenneth Casson Leighton 2023-03-20 20:22:17 GMT
(In reply to Jacob Lifshay from comment #24)

> still don't agree with only having a crbinlut instruction that uses a CR as
> the LUT.

i have explained adequately that those reasons are invalid due to lack of
knowledge on your part. the instructions remain in. if you do not understand then
please ask questions to clarify - just not under this bugreport, it is
dedicated to the writing of the RFC.
Comment 26 Jacob Lifshay 2023-03-20 20:47:43 GMT
(In reply to Luke Kenneth Casson Leighton from comment #25)
> (In reply to Jacob Lifshay from comment #24)
> 
> > still don't agree with only having a crbinlut instruction that uses a CR as
> > the LUT.

> the instructions remain in.

in that case, please let me insert a note to the isa wg that we considered the alternative of also/instead having a crbinlog that uses a gpr for the lut but you overrode my objections, because otherwise i consider intentionally submitting something with known easily-fixable flaws from my perspective and intentionally hiding them to be unethical and I can not sign off on that. I am not trying to say you are unethical, because you honestly believe those to not be flaws.

> i have explained adequately that those reasons are invalid due to lack of
> knowledge on your part.

I understand you point, however I know that your justification of power usage is almost certainly incorrect because of nearly all usage of crbinlog requiring a separate instruction to move the lut from a gpr immediately before crbinlog...executing a full separate instruction almost certainly requires substantially more power than having crbinlog read the lut from a GPR directly.

also, there are ease-of-use and code size considerations due to the extra gpr -> cr move usually being required.
Comment 27 Luke Kenneth Casson Leighton 2023-03-21 00:18:24 GMT
(In reply to Jacob Lifshay from comment #26)
> (In reply to Luke Kenneth Casson Leighton from comment #25)
> > (In reply to Jacob Lifshay from comment #24)
> > 
> > > still don't agree with only having a crbinlut instruction that uses a CR as
> > > the LUT.
> 
> > the instructions remain in.
> 
> in that case, please let me insert a note to the isa wg that we considered
> the alternative of also/instead having a crbinlog that uses a gpr for the
> lut 

and it was rejected with 100% rational and reasonable logical technical
arguments based on the experience of Tom Forsyth and an *extremely detailed*
explanation based on a comprehensive reasonable intelligent analysis of the
potential design of POWER9 and POWER10...

... and you completely goddamned ignored every single word i wrote.

this repeated pattern from you of ignoring my technical explanations
is getting absolutely intolerable, i'm not going to be able to put up with
it much longer.

> but you overrode my objections, because otherwise i consider
> intentionally submitting something with known easily-fixable flaws from my
> perspective and intentionally hiding them to be unethical and I can not sign
> off on that. 

jacob: that is the worst possible thing that you could possibly do.
please re-read the message that i sent to you privately last week,
in full, and please do not discuss this again publicly.

thank you for respecting what you are being requested to do.
please cease and desist pursuing this any further, the matter
is closed.
Comment 28 Jacob Lifshay 2023-03-21 01:22:17 GMT
i changed my mind on crbinlog, luke's argument that it would make the dependency matrix too large for CR/GPR instructions because there are no other instructions that both read/write multiple CR fields and read a GPR makes sense to me -- sad from a software perspective, but, oh well.

https://libre-soc.org/irclog/%23libre-soc.2023-03-21.log.html

So, no note needed, we now both agree that crbinlog should take the lut argument from a CR.
Comment 29 Luke Kenneth Casson Leighton 2023-03-22 12:14:14 GMT
submittee 2023mar22