1135 – add FPSCR and Rounding classes to ieee754fpu

Bug 1135 - add FPSCR and Rounding classes to ieee754fpu

Summary: add FPSCR and Rounding classes to ieee754fpu

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	ALU (including IEEE754 16/32/64-bit FPU) (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Jacob Lifshay

URL:

Depends on:
Blocks:	1134 1136 1137
	Show dependency tree / graph

Reported:	2023-08-10 00:28 BST by Jacob Lifshay
Modified:	2024-01-08 01:04 GMT (History)
CC List:	3 users (show)

See Also:	1139 1247
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jacob Lifshay 2023-08-10 00:28:50 BST

to allow FP ops to compute in parallel despite each fp op semantically reading the FPSCR output from the previous op, the FPSCR will be split into 3 parts (I picked names that aren't necessarily standard names):
* volatile part: written nearly every insn but is rarely read
    FR, FI, FPRF
* sticky part: generally doesn't change but is read and written by nearly all insns:
    all the sticky exception bits
* control part: generally doesn't change and is only read by nearly all insns:
    all the other bits

the above is all that will be implemented as part of this task, all of the below is just explaining why we should split FPSCR into parts and is explicitly *OUT-OF-SCOPE* for this task and all follow-on tasks for the near future:

the idea is that the cpu will have all three parts in separate registers and will speculatively execute fp insns with the current value of the sticky part register (not the one from the previous instruction, but the one from the register, avoiding needing a dependency chain), and then will cancel and retry all later insns if it turns out that the insn changed the sticky part (which is rare).

if desired the control part can be put in the same register and handled the same way as the sticky part, but this makes code that temporarily changes the rounding mode slower than necessary (common in x87 emulation and some math library functions).

Comment 1 Luke Kenneth Casson Leighton 2023-08-10 20:52:16 BST

(In reply to Jacob Lifshay from comment #0)
> to allow FP ops to compute in parallel despite each fp op semantically
> reading the FPSCR output from the previous op, the FPSCR will be split into
> 3 parts (I picked names that aren't necessarily standard names):
> * volatile part: written nearly every insn but is rarely read
>     FR, FI, FPRF
> * sticky part: generally doesn't change but is read and written by nearly
> all insns:
>     all the sticky exception bits
> * control part: generally doesn't change and is only read by nearly all
> insns:
>     all the other bits
> 
> the above is all that will be implemented as part of this task, all of the
> below is just explaining why we should split FPSCR into parts:
> 
> the idea is that the cpu will have all three parts in separate registers and
> will speculatively execute fp insns with the current value of the sticky
> part register (not the one from the previous instruction, but the one from
> the register, avoiding needing a dependency chain),

please do not do that.  if there is a dependency chain it is just tough luck.
the programmer is already warned in the spec "some things might be slower"
and surprise, that's what they get.

> and then will cancel and
> retry all later insns if it turns out that the insn changed the sticky part
> (which is rare).

no, you REALLY do not want to be doing that.

follow EXACTLY how XER works, please, starting with adding FPSCR as
"its own register file".

do NOT attempt repeat DO NOT attempt to add "speculation" of ANY KIND.

do NOT attempt repeat DO NOT attempt to make drastic modifications to
the existing design.

do NOT repeat DO NOT assume that "the first implementation has to be
fastest bestest most amazingest most brilliant most highest performance".
we need WORKING, first.

please follow this procedure:

* split the FPSCR-regfile into the four (or more) parts that you advocated
* pass in the parts of FPSCR that *might* be written to, as "read operands"
  (these will be written-out *if* needed)
* pass in an immediate operand (in the Record)
  "fp_overflow_just_like_xer_so_overflow"
  - this if clear is how you know that the copy of FPSCR will not be
    read, and consequently not be written to
  - however if set then you pass through the copy of the FPSCR bits
    right the way through all pipeline stages.
* and EXACTLY as is done with XER.SO when overflow is enabled,
  have the final stage of the pipeline set or clear the "data.ok"
  bit.

this "data.ok" bit will indicate to the register file, which will
have been waiting for that result, that "actually write is not needed".

i repeat DO NOT deviate from the existing micro-architectural design
IN ANY WAY.

BEFORE BEGINNING please can you describe in your own words precisely and
exactly how XER.SO XER.CA/32 and XER.OV/32 work, and how they are part
of a special "regfile".

you will need to analyse the ALU CompUnits and pipelines, as well as the
reg data structures and observe how the Records have an "overflow" entry
that is passed right all the way down through the "stages", and how
each pipeline stage manually copies (and occasionally modifies) the
inputs thru to the outputs, *and* how SPECIAL ATTENTION is paid to
copying the "ok" bit from input thru output

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/common_output_stage.py;hb=HEAD

Comment 2 Luke Kenneth Casson Leighton 2023-08-10 20:59:06 BST

XER regfile:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfiles.py;h=5ef301a8bf

it should be obvious that a corresponding FPSCR regfile is needed,
and that the bits in each "register" should be of equal length.
if there are extras then tough luck: see line 215:

    215     SO=0 # this is actually 2-bit but we ignore 1 bit of it

although subdividing (grouping) the reg-bits into chunks ("all the other bits")
sounds reasonable, to be honest if they are "fake" then don't bother, just
hard-code them with a Const in the mv routine: this is done with MSR as
well as XER so it's standard fare.

Comment 3 Luke Kenneth Casson Leighton 2023-08-10 21:02:33 BST

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/alu/pipe_data.py;hb=HEAD

regspec (in and out) contain the XER regfile xer_so and xer_ca "registers",
it should be obvious what to do: FPSCR regfile then FPSCR_whatever_was_defined
and the rest is a breeze.

Comment 4 Luke Kenneth Casson Leighton 2023-08-10 21:12:16 BST

log notes on how "speculative execution" is to work.

it is NOT repeat NOT to be implemented in the way that you describe.

it is to be implemented by performing a Shadow-Hold that does nothing
more complex than:

* hold off the write to the regfile until it is known
  - if the exception has occurred (in which case "CANCEL WRITE" is raised)
  - if there is NO CHANCE of the exception occurring
    (in which case "DROP SHADOW" is raised)

it's really very simple.

but what you need to write right now is the infrastructure that *when*
we get to doing OoO it is easy to do so because everything is in place...

but we are NOT repeat NOT going to smash down the entire 4 years of planning
and design work to create a completely new set of ideas by implementing
something that is unauthorised and has not been discussed or reviewed...
and neither are we going to implement something that would take SIX MONTHS
to perform that review.

please stick to the original micro-architectural plan that was put in place
*well over three years ago*.

and *do not* deviate from what i have written here, nor attempt *in any way*
to implement ANY form of "speculative" pipelining, system, infrastructure,
or design.

the components that i designed, as-designed may be used in *any* type of
micro-architecture.

we happen to be starting with the simplest because it is most straightforward
and gets results the fastest and with the least complications.

further grants can cover *incremental* improvements.

Comment 5 Luke Kenneth Casson Leighton 2023-08-10 21:13:25 BST

https://libre-soc.org/irclog/%23libre-soc.2023-08-09.log.html#t2023-08-09T20:36:37

Comment 6 Jacob Lifshay 2023-08-10 21:37:43 BST

(In reply to Luke Kenneth Casson Leighton from comment #1)
> (In reply to Jacob Lifshay from comment #0)
> > the idea is that the cpu will have all three parts in separate registers and
> > will speculatively execute fp insns with the current value of the sticky
> > part register (not the one from the previous instruction, but the one from
> > the register, avoiding needing a dependency chain),
> 
> please do not do that.

I'm not saying we need that now, but later we will, because otherwise all fp arithmetic ops both read and write potentially the sticky bits (since they or-in their detected exceptions) so if you have a fp add take 3 cycles, then:
VL = 64
sv.fadd *0, *0, *64

will take at least *192 cycles*, no matter *how wide the SIMD units are or how many there are* because the first element potentially writes the sticky bits which the second reads, so the second element has to wait for the first element, the second element potentially writes the sticky bits which the third element reads, so the third has to wait for the second, and so on.
some of those bits (inexact flag) can't be calculated till the very last pipeline stage since they're calculated as part of the final rounding.

>  if there is a dependency chain it is just tough luck.
> the programmer is already warned in the spec "some things might be slower"
> and surprise, that's what they get.

basically every other OoO cpu has the sticky bits handled specially for exactly the reason that I explained above. SO can be handled using dependency chains since writing to it is uncommon, the sticky bits can't because nearly every fp insn potentially writes to them. other cpus often have special hardware that accumulates the sticky bits from all instructions currently being run, so fp ops can run at full speed since the accumulation can be done for many insns per clock, the slow part is *reading* from that special accumulation hardware, because the cpu has to wait for *all* prior fp ops to complete first, often doing a full cpu flush.

> 
> > and then will cancel and
> > retry all later insns if it turns out that the insn changed the sticky part
> > (which is rare).
> 
> no, you REALLY do not want to be doing that.

that's exactly what we need much later, though for now we can just use dependency chains and just have really slow fp.

> 
> follow EXACTLY how XER works, please, starting with adding FPSCR as
> "its own register file".

none of the FPSCR registers are being added by this bug or any of the ieee754fpu work, that all happens in soc.git, later.

> 
> do NOT attempt repeat DO NOT attempt to add "speculation" of ANY KIND.

none of that is being added here, I'm just planning ahead for when we'll need it much later.

> please follow this procedure:
> 
> * split the FPSCR-regfile into the four (or more) parts that you advocated
> * pass in the parts of FPSCR that *might* be written to, as "read operands"
>   (these will be written-out *if* needed)
> * pass in an immediate operand (in the Record)
>   "fp_overflow_just_like_xer_so_overflow"
>   - this if clear is how you know that the copy of FPSCR will not be
>     read, and consequently not be written to

that doesn't really work because, unlike OE=1 which is usually switched off, *all* fp computation ops *always* generate sticky bits outputs, that need to be or-ed into FPSCR.

some of those sticky outputs are extremely commonly set (such as inexact, which is set whenever there is any rounding error whatsoever), the reason FPSCR doesn't commonly change is because those corresponding flags will usually have already been set in FPSCR, so or-ing in more 1s doesn't change the 1 that's already there.

>   - however if set then you pass through the copy of the FPSCR bits
>     right the way through all pipeline stages.

I'm planning on just passing the FPSCR parts through the pipeline stages, modifying the parts as needed.

> * and EXACTLY as is done with XER.SO when overflow is enabled,
>   have the final stage of the pipeline set or clear the "data.ok"
>   bit.

i can do that, and just set the .ok bits when those FPSCR parts needs to change. the volatile part will need to change nearly every time (but doesn't usually get read so is fine), the sticky part rarely (but insns can't easily tell until the last pipe stage), and the control part only for specific control insns.

> BEFORE BEGINNING please can you describe in your own words precisely and
> exactly how XER.SO XER.CA/32 and XER.OV/32 work, and how they are part
> of a special "regfile".

I'm simplifying slightly since I don't want to write 10 pages of text:
SO/CA[32]/OV[32] are passed as inputs from registers/dependency-tracking to all relevant ALUs, those ALUs check OE=1, which if set, then they or-in their overflow output and signal that SO/OV[32] need to be written. dependency tracking then checks if the output is set as written and if so delays until the output is computed, then writes that output to the registers and/or other insn inputs as necessary. if the output is not set as written (computable at decode time, but i think we delay for some insns), then the dependency tracking uses the old SO/OV[32]/CA[32] and forwards that from registers/etc. to later insns.

Comment 7 Jacob Lifshay 2023-08-10 21:44:55 BST

(In reply to Luke Kenneth Casson Leighton from comment #4)
> * hold off the write to the regfile until it is known
>   - if the exception has occurred (in which case "CANCEL WRITE" is raised)
>   - if there is NO CHANCE of the exception occurring
>     (in which case "DROP SHADOW" is raised)

I assumed that, i don't know how you got it into your head that I meant we should rip that out.

according to my plan, fp ops will shadow until they know the sticky bits haven't been changed and no fp traps are needed, at which point they drop the shadow. if either of those are needed, then standard trap procedures are followed, except that setting sticky bits is more like a branch misprediction in that it just restarts the following insns instead of a trap where it changes MSR and goes to e.g. PC=0x700.

Comment 8 Jacob Lifshay 2023-08-10 21:46:00 BST

(In reply to Jacob Lifshay from comment #7)
> (In reply to Luke Kenneth Casson Leighton from comment #4)
> > * hold off the write to the regfile until it is known
> >   - if the exception has occurred (in which case "CANCEL WRITE" is raised)
> >   - if there is NO CHANCE of the exception occurring
> >     (in which case "DROP SHADOW" is raised)
> 
> I assumed that, i don't know how you got it into your head that I meant we
> should rip that out.

to clarify, this is the plan for much later, for now FPSCR will be dependency-tracked like any other register.

> 
> according to my plan, fp ops will shadow until they know the sticky bits
> haven't been changed and no fp traps are needed, at which point they drop
> the shadow. if either of those are needed, then standard trap procedures are
> followed, except that setting sticky bits is more like a branch
> misprediction in that it just restarts the following insns instead of a trap
> where it changes MSR and goes to e.g. PC=0x700.

Comment 9 Jacob Lifshay 2023-08-10 21:56:59 BST

(In reply to Luke Kenneth Casson Leighton from comment #2)
> XER regfile:
> 
> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfiles.py;
> h=5ef301a8bf
> 
> it should be obvious that a corresponding FPSCR regfile is needed,
> and that the bits in each "register" should be of equal length.

really we only need 1 register for each of the 3 different FPSCR parts, those parts are not equal length:
volatile part: 7 bits
sticky part: like 10-15 bits
control part: the rest, like 10-20 bits.

though if we add register renaming at some point (I don't think we're planning on that) we'll need more than 1 of each.

if the pipeline code can only handle 2-bit registers, then it needs improvement.

Comment 10 Jacob Lifshay 2023-08-10 21:58:08 BST

(In reply to Jacob Lifshay from comment #9)
> really we only need 1 register for each of the 3 different FPSCR parts,

so 3 registers total.

Comment 11 Jacob Lifshay 2023-08-10 22:06:35 BST

so, can i start on this task, which is *only* to add the FPSCR class like in openpower-isa and split it into those three parts and add a rounding mode enum?

Comment 12 Luke Kenneth Casson Leighton 2023-08-11 13:05:49 BST

(In reply to Jacob Lifshay from comment #6)

> will take at least *192 cycles*, no matter *how wide the SIMD units

not a problem at all for TestIssuer which ony issues one element at a time 

TestIssuer is the sole primary focus of this task.


> basically every other OoO cpu has the sticky bits handled specially for
> exactly the reason that I explained above.

at the regfile an ORing may be performed instead of a "set".
this is NOT your problem to immediately deal with.

i repeat again: we need a WORKING implementation NOT repeat NOT
i repeat again NOT a FAST implementation.


> can be done for many insns per clock, the slow part is *reading* from that
> special accumulation hardware, because the cpu has to wait for *all* prior
> fp ops to complete first, often doing a full cpu flush.

again please LISTEN. other CPUs are NOT IMPORTANT. you keep doing
this.  it is TOO COMPLEX for the size of the Grant work to do
both "first implementation" **AND** "FAST OPTIMAL implementation".

please LISTEN, follow my directions and guidance, and do not argue.

i have told you many many that you are unable to properly do time
and budget scoping, last time was *literally* 36 hours ago and you
are literally within 36 hours attempting to massively expand
the scope far beyond what the available budget can handle.

STOP IT!!!


> that's exactly what we need much later, though for now we can just use
> dependency chains and just have really slow fp.

correct.  get results, get paid, make NLnet look good, EU is happy to
give them more money, we apply and get it.


> that doesn't really work

it works perfectly: yet again however you are attempting to change
the scope from "A Working First Implementation" to "A massively
complex heavily-optimised implementation".

*please stop doing that*/

> because, unlike OE=1 which is usually switched off,
> *all* fp computation ops *always* generate sticky bits outputs, that need to
> be or-ed into FPSCR.

that can be done LATER.  **NOT NOW**. it is a complex task on its own
and if we do not have WORKING code first it is TOO MUCH.

> >   - however if set then you pass through the copy of the FPSCR bits
> >     right the way through all pipeline stages.
> 
> I'm planning on just passing the FPSCR parts through the pipeline stages,
> modifying the parts as needed.

perfect.

please REMOVE all and QNY mention of "speculative execution" from
comment #0.

that is a FOLLOWON task that will require its own special extremely
LARGE budget.


> I'm simplifying slightly since I don't want to write 10 pages of text:
> SO/CA[32]/OV[32] are passed as inputs from registers/dependency-tracking to
> all relevant ALUs, those ALUs check OE=1, which if set, then they or-in
> their overflow output and signal that SO/OV[32] need to be written.

yes. i think you missed that every output is a "Data Record" which has
a data member and an ok member....

> dependency tracking then checks if the output is set as written 

the "ok" flag, yes. you didn't miss it, awesome.

> and if so
> delays until the output is computed, then writes that output to the
> registers and/or other insn inputs as necessary. if the output is not set as
> written (computable at decode time,

it isn't.  it's computable *whether* it *could* be set.

> but i think we delay for some insns),
> then the dependency tracking uses the old SO/OV[32]/CA[32] and forwards that
> from registers/etc. to later insns.

that's way into the future.

TestIssuer simply goes, when the result pops out (NextControl ready flag)
"erm was the ok flag set, if so i'll just request a write to the regfile"
and if thereare no outstanding writes left, TestIssuer is ONLY THEN
permitted to even FETCH the next instruction, let alone decode it.

and that gets us the money and it was a lot less work.

----

(future work ONLY, thoroughly out of scope for this task):

one trick you missed above for FPSCR "sticky" bits (and i have not had
time to put this into XER.SO yet, either): if the XER.SO flag is
already set YOU DO NOT NEED TO WRITE IT.

therefore you can REMOVE that Write-Hazard entirely.

the "problem" you describe about how sticky bits would slow things
down is *only* the case if that bit is clear at the time of issue,
and some ORing at periodic intervals (defined in part by the maximum
size of the Shadow Matrix) takes care of the other cases.

but again i repeat again i repeat AGAIN:

under NO CIRCUMSTANCES attempt to implement that right now.

get everything working under TestIssuer only, and please remove
all mention of "speculation" from this task.  it is however
good that you understand the problem and the future direction.

Comment 13 Luke Kenneth Casson Leighton 2023-08-11 13:16:05 BST

(In reply to Jacob Lifshay from comment #7)
> (In reply to Luke Kenneth Casson Leighton from comment #4)
> > * hold off the write to the regfile until it is known
> >   - if the exception has occurred (in which case "CANCEL WRITE" is raised)
> >   - if there is NO CHANCE of the exception occurring
> >     (in which case "DROP SHADOW" is raised)
> 
> I assumed that, i don't know how you got it into your head that I meant we
> should rip that out.

no: you listed *as this task* that a FULL BLOWN OUT OF ORDER CORE
would be written, in order that Shadowing could even be used!

that means that *under this task* you have to do get the Shadow Matrices
in, write the unit tests, do EVERYTHING that *I* know is tens of
thousands of Euros worth of work!


> according to my plan, fp ops will shadow until they know the sticky bits
> haven't been changed and no fp traps are needed, at which point they drop
> the shadow.

to do that requires an OoO core as a prerequisite. you're *literally*
designing an ENTIRE Out-of-Order Core for the scant budget of
EUR 2500 or less...

AS PART OF THIS TASK!!

you see what i mean that you are unable to perform budget scoping and
time management?


>  if either of those are needed, then standard trap procedures are
> followed, except that setting sticky bits is more like a branch
> misprediction in that it just restarts the following insns instead of a trap
> where it changes MSR and goes to e.g. PC=0x700.

yyep. now you know why TestIssuer only does one instruction at a time.

see OP_TRAP and in particular notice how "sc" is treated as
"a type of trap".

there is absolutely no way in hell that a tiny budget of this size
could possbly cover the FULL writing of an OoO core necesaary to
do what you are proposing.

lesson hopefully taken on board: do things INCREMENTALLY, in small
chunks, leveraging what *already exists*.

we'll get more grants, don't worry. but we need to demonstrate an
ability to complete what's in front of us.

Comment 14 Luke Kenneth Casson Leighton 2023-08-11 13:21:55 BST

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/trap/main_stage.py;h=8127e2#l364

OP_SC is treated as a non-optional "branch that happens to swap
MSR and PC with SRR0/1".

OP_TRAP *again* would have Shadow-Hold in an OoO core but for
both TestIssuer and SimpleInOrderCore there is flat-out no chance.
stall is the only "safe" option.

now you know why i said "do everything as FSMs" because there *is*
no "high performance" here.  only one FP operation must be allowed
at a time anyway.

Comment 15 Jacob Lifshay 2023-08-11 15:13:29 BST

(In reply to Luke Kenneth Casson Leighton from comment #12)
> (In reply to Jacob Lifshay from comment #6)
> TestIssuer is the sole primary focus of this task.

no, ieee754fpu is.

> at the regfile an ORing may be performed instead of a "set".

that doesn't solve the problem that the special accumulator does, which is that future ops are delayed by waiting to see if FPSCR is updated.

> this is NOT your problem to immediately deal with.

yes, as I stated multiple times, this task only splits FPSCR into 3 parts and has *no other work* related to speculative execution.
> 
> i repeat again: we need a WORKING implementation NOT repeat NOT
> i repeat again NOT a FAST implementation.

this task will take an additional 15min or so beyond that, for splitting FPSCR into three parts, which I think is an acceptable amount of time to use on planning ahead. Our CPU is in no way obligated to implement any of the speculative speedups we discussed, those are all for much later.

> 
> > can be done for many insns per clock, the slow part is *reading* from that
> > special accumulation hardware, because the cpu has to wait for *all* prior
> > fp ops to complete first, often doing a full cpu flush.
> 
> again please LISTEN. other CPUs are NOT IMPORTANT. you keep doing
> this.  it is TOO COMPLEX for the size of the Grant work to do
> both "first implementation" **AND** "FAST OPTIMAL implementation".

please listen, I said several times that that's merely why I think FPSCR should be split into 3 parts, other than that, all fast fp methods discussed here *are not part of this task or any other follow on tasks*.
> 
> STOP IT!!!

I am *not expanding the scope* more than the amount of time it takes to type this reply, which is imho a trivial amount of time. I've said explicitly from the beginning:

> the above is all that will be implemented as part of this task, all of the below is just explaining why we should split FPSCR into parts:

this means that implementing any of the cpu optimizations beyond merely splitting FPSCR *IS OUT OF SCOPE*.

> 
> 
> > that's exactly what we need much later, though for now we can just use
> > dependency chains and just have really slow fp.
> 
> correct.  get results, get paid, make NLnet look good, EU is happy to
> give them more money, we apply and get it.
> 
> 
> > that doesn't really work
> 
> it works perfectly: yet again however you are attempting to change
> the scope from "A Working First Implementation" to "A massively
> complex heavily-optimised implementation".
> 
> *please stop doing that*/
> 
> > because, unlike OE=1 which is usually switched off,
> > *all* fp computation ops *always* generate sticky bits outputs, that need to
> > be or-ed into FPSCR.
> 
> that can be done LATER.  **NOT NOW**. it is a complex task on its own
> and if we do not have WORKING code first it is TOO MUCH.
> 
> > >   - however if set then you pass through the copy of the FPSCR bits
> > >     right the way through all pipeline stages.
> > 
> > I'm planning on just passing the FPSCR parts through the pipeline stages,
> > modifying the parts as needed.
> 
> perfect.
> 
> please REMOVE all and QNY mention of "speculative execution" from
> comment #0.

no, they are explicitly labeled as out-of-scope in comment #0 (which you somehow missed all this time) and documenting why splitting should occur at all.
> 
> that is a FOLLOWON task that will require its own special extremely
> LARGE budget.

yes, that follow-on task will be much later, not part of this grant. though I wouldn't expect it to be hugely complex to implement since we just reuse branch misprediction machinery which we'll need anyway.

we can plan exactly what we want to do then.
> 
> 
> > I'm simplifying slightly since I don't want to write 10 pages of text:
> > SO/CA[32]/OV[32] are passed as inputs from registers/dependency-tracking to
> > all relevant ALUs, those ALUs check OE=1, which if set, then they or-in
> > their overflow output and signal that SO/OV[32] need to be written.
> 
> yes. i think you missed that every output is a "Data Record" which has
> a data member and an ok member....
> 
> > dependency tracking then checks if the output is set as written 
> 
> the "ok" flag, yes. you didn't miss it, awesome.
> 
> > and if so
> > delays until the output is computed, then writes that output to the
> > registers and/or other insn inputs as necessary. if the output is not set as
> > written (computable at decode time,
> 
> it isn't.  it's computable *whether* it *could* be set.

for CA[32]/OV[32] it should be computable at decode time even if we don't currently.

> 
> > but i think we delay for some insns),
> > then the dependency tracking uses the old SO/OV[32]/CA[32] and forwards that
> > from registers/etc. to later insns.
> 
> that's way into the future.

yes, but that's irrelevant for how the ALUs are designed, as far as their concerned they produce their outputs and the cpu takes care of them, how it does that is irrelevant.

> one trick you missed above for FPSCR "sticky" bits (and i have not had
> time to put this into XER.SO yet, either): if the XER.SO flag is
> already set YOU DO NOT NEED TO WRITE IT.

yes, but unlike SO there are many sticky bits and there are usually a few that are still zeros (because linux initializes them to zero and some fp exceptions are extremely rare/impossible in important programs, e.g. if you never use sqrt, the invalid sqrt flag can never be set), so relying only on that optimization is unwise.

> under NO CIRCUMSTANCES attempt to implement that right now.

I'm not (beyond merely splitting FPSCR) and was never planning on that and stated that many times.
> 
> get everything working under TestIssuer only, and please remove
> all mention of "speculation" from this task.  it is however
> good that you understand the problem and the future direction.

i'm not removing "speculation" from the description, it's there as documentation of why we're doing what were doing, it is explicitly labeled *OUT-OF-SCOPE* for this task.

Comment 16 Jacob Lifshay 2023-08-11 15:20:12 BST

(In reply to Jacob Lifshay from comment #15)
> i'm not removing "speculation" from the description, it's there as
> documentation of why we're doing what were doing, it is explicitly labeled
> *OUT-OF-SCOPE* for this task.

I edited the description to make it much more clearly labeled as *OUT-OF-SCOPE*.

Comment 17 Luke Kenneth Casson Leighton 2023-08-11 16:35:11 BST

(In reply to Jacob Lifshay from comment #15)
> (In reply to Luke Kenneth Casson Leighton from comment #12)
> > (In reply to Jacob Lifshay from comment #6)
> > TestIssuer is the sole primary focus of this task.
> 
> no, ieee754fpu is.

Jacob: i used the wrong words.  it should be obvious
that *using* TestIssuer is required to test the FPU,
as if it cannot be integrated into TestIssuer it's
100% useless.

Comment 18 Jacob Lifshay 2023-08-11 17:23:08 BST

(In reply to Luke Kenneth Casson Leighton from comment #17)
> Jacob: i used the wrong words.  it should be obvious
> that *using* TestIssuer is required to test the FPU,

it isn't, all current ieee754fpu tests don't rely on it.

> as if it cannot be integrated into TestIssuer it's
> 100% useless.

yes, more or less.

for this task specifically, my plan is to manually check it against the spec pdf or openpower-isa's fpscr since it's just datatypes.

Comment 19 Luke Kenneth Casson Leighton 2023-08-11 22:38:40 BST

(In reply to Jacob Lifshay from comment #18)
> (In reply to Luke Kenneth Casson Leighton from comment #17)
> > Jacob: i used the wrong words.  it should be obvious
> > that *using* TestIssuer is required to test the FPU,
> 
> it isn't, all current ieee754fpu tests don't rely on it.

the top level bug funding this is bug #1025. unit tests are
always needed, but so are assembler instruction unit tests
(test_caller_fp.py)

> > as if it cannot be integrated into TestIssuer it's
> > 100% useless.
> 
> yes, more or less.

not "more or less" - just "yes". please don't force me to extract
a simple acknowledgment of a logical progressive chain of tasks.

bug #1025 states:

  The first NLnet grant allowed us to create IEEE754 FP pipelines,
  which now need integration into the Libre-SOC Core, and suitable
  unit tests created. 


> for this task specifically, my plan is to manually check it against the spec
> pdf or openpower-isa's fpscr since it's just datatypes.

ok. then be *really careful* not to do more work than bug #1025 can
handle. there's only EUR 8,000 available and appx EUR 1500 of that needs
to go to me.

Comment 20 Jacob Lifshay 2023-08-12 01:59:03 BST

Added FPSCR and FPSCR part classes and some unit tests.

https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=bcb19b219d2b21c3086ded1ec8fa6cf2048020b8

commit bcb19b219d2b21c3086ded1ec8fa6cf2048020b8
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Fri Aug 11 17:52:34 2023 -0700

    add FPSCR with parts

Since this task is pretty small, and nlnet didn't want the hassle of small tasks, I think it would be best to have its budget be part of the fminmax task, since that's somewhat small too.

Comment 21 Jacob Lifshay 2023-08-12 02:29:12 BST

(In reply to Luke Kenneth Casson Leighton from comment #19)
> (In reply to Jacob Lifshay from comment #18)
> > (In reply to Luke Kenneth Casson Leighton from comment #17)
> > > as if it cannot be integrated into TestIssuer it's
> > > 100% useless.
> > 
> > yes, more or less.
> 
> not "more or less" - just "yes".

I said more or less because ieee754fpu is useful to others independently of whether or not they want to use it in soc.git. e.g. if someone wanted to build a RISC-V cpu, they could use ieee754fpu to do the FP for them, with some minor adapting.

Comment 22 Luke Kenneth Casson Leighton 2023-08-12 06:08:02 BST

(In reply to Jacob Lifshay from comment #21)

> I said more or less because ieee754fpu is useful to others 

screw others.  we need money NOW. *WE* need money NOW. tasks
need to be completed NOW.

> independently of
> whether or not they want to use it in soc.git. e.g. if someone wanted to
> build a RISC-V cpu, they could use ieee754fpu to do the FP for them, with
> some minor adapting.

RISC-V and other CPUs can take a hike.

you're wasting my time being excessively pedantic and forcing me into
petty "micro-managing" to get you to focus, which is demeaning for
both of us.

FOCUS ON GETTING THE JOB DONE so that money comes in, because
tasks completed NOW get US money LATER. other people's benefit
is an accidental side-effect but i can pretty much guarantee they
won't bother paying us.

therefore FORGET "other people", they are utterly unimportant.


(In reply to Jacob Lifshay from comment #20)

> Since this task is pretty small, and nlnet didn't want the hassle of small
> tasks, I think it would be best to have its budget be part of the fminmax
> task, since that's somewhat small too.

don't waste time thinking "all tasks shall have budget therefore start
modifying dependencies or depends/blocks".

not all tasks are required to even have a budget, it is too much
detail for NLnet.

also you missed the task "link ieee754fpu into TestIssuer and write
unit tests for the new FPU CompUnit(s)" which is the one that
*actually* gets money assigned to it.

at EUR 8,000 you (not me) need to think in terms of three to five
sub-tasks under bug #1025 that actually get money assigned.

all others under that (with a zero budget, and using blocks/depends
NOT "parent budget allocation") are for *you* to manage and make clear
to others (as part of Audit and Review) what you are doing.

they are also to help YOU keep track of what you are doing,
as there is a lot of small pieces to get done here.