Bug 1184 - Proposal for fixing XLEN: XLEN always is type-len, add FTYPE and FSTYPE for FP
Summary: Proposal for fixing XLEN: XLEN always is type-len, add FTYPE and FSTYPE for FP
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Jacob Lifshay
URL:
Depends on:
Blocks:
 
Reported: 2023-10-11 17:36 BST by Jacob Lifshay
Modified: 2023-10-12 02:20 BST (History)
3 users (show)

See Also:
NLnet milestone: Future
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Lifshay 2023-10-11 17:36:35 BST
I have an idea for how to handle XLEN for FP:
currently, for FP, XLEN is poorly-defined for BFP16 and BFloat16, since we currently need XLEN to both match element size and allow distinguishing between BFP16 and BFloat16, which are conflicting requirements since they're both 16-bit types.

Therefore, I think we should introduce 2 new variables like XLEN that cleanly specify which types should be used for FP and FP Single instructions, allowing XLEN to only convey type size. FTYPE specifies which format is used for FP instructions, as well as the in-register format for FP Single instructions. FSTYPE specified which format is used for computations in FP Single instructions.

| SVP64 elwid | Int XLEN | FP XLEN | FTYPE    | FSTYPE | Notes          |
|-------------|----------|---------|----------|--------|----------------|
| 00          | 64       | 64      | BFP64    | BFP32  | DEFAULT values |
| 01          | 32       | 32      | BFP32    | BFP16  |                |
| 10          | 16       | 16      | BFP16    | -      |                |
| 11          | 8        | 16      | BFloat16 | -      |                |

Additionally, pseudocode will, instead of using bfp64_* or bfp32_* pseudocode functions directly, use new f_* and fs_* pseudocode functions that switch on F[S]TYPE and call the appropriate functions.

e.g.:

fatan2s pseudocode becomes:
FRT <- DOUBLE(fs_ATAN2(SINGLE(FRA), SINGLE(FRB)))

fatan2 pseudocode becomes:
FRT <- f_ATAN2(FRA, FRB)


This still leaves the issue of what to set XLEN to for instructions like ctfpr that are both integer and fp operations, I had proposed having FLEN for FP and XLEN for integer, but that was rejected (maybe reconsider?).
Comment 1 Luke Kenneth Casson Leighton 2023-10-11 18:33:47 BST
(In reply to Jacob Lifshay from comment #0)
> I have an idea for how to handle XLEN for FP:
> currently, for FP, XLEN is poorly-defined for BFP16 and BFloat16, since we
> currently need XLEN to both match element size and allow distinguishing
> between BFP16 and BFloat16, which are conflicting requirements since they're
> both 16-bit types.

a function helps there. or, a global variable inserted into the
namespace (and spec)


> This still leaves the issue of what to set XLEN to for instructions like
> ctfpr that are both integer and fp operations, I had proposed having FLEN
> for FP and XLEN for integer, but that was rejected (maybe reconsider?).

it's the one crossover point that makes the different elwidths a bit
hairy.  INT ops can get away with overcalculating then truncating
(dropping bits) but FP ops can't.

i really wanted to avoid two XLENs. fp-int converts i think make them
unavoidabe but i am sure there is a workaround.

good to record: can we leave detailed discussions until much later.
Comment 2 Jacob Lifshay 2023-10-11 18:37:41 BST
another closely related issue, fp ops with different src and dest elwidth end up double-rounding according to the current spec, which is bad.
Comment 3 Luke Kenneth Casson Leighton 2023-10-11 20:57:23 BST
(In reply to Jacob Lifshay from comment #2)
> another closely related issue, fp ops with different src and dest elwidth
> end up double-rounding according to the current spec, which is bad.

good reason for programmers to avoid doing that by not using
different widths, then, isn't it?

we are not here to "nanny" people [making hardware more complex
in order to "protect" them from shooting themselves in the foot]

one to think through in the future. not now.
Comment 4 Jacob Lifshay 2023-10-12 02:20:03 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)
> (In reply to Jacob Lifshay from comment #2)
> > another closely related issue, fp ops with different src and dest elwidth
> > end up double-rounding according to the current spec, which is bad.
> 
> good reason for programmers to avoid doing that by not using
> different widths, then, isn't it?
> 
> we are not here to "nanny" people [making hardware more complex
> in order to "protect" them from shooting themselves in the foot]

well, now that I think of it, we may be making the hardware more complex by *not* avoiding double-rounding. e.g.:

sv.fadds/sw=f64/dw=f32 has to do:

convert f64-in-f32 sources to internal format
add sources
round result to f32 (as expensive as converting to f32 due to denormals)
convert f32 to internal format
round to f16 (as expensive as converting to f16)
convert f16 to f32

if we avoided double rounding, it would be:
convert f64-in-f32 sources to internal format
add sources
round result to f16 (as expensive as converting to f16)
convert f16 to f32

Note that when the inputs are the same type or strictly smaller than the outputs, then there isn't a problem, because the extra conversions on the inputs are exact and so we can just convert straight to the internal format instead of doing two conversions.

So, what I think we should do about it:
I think we should just define as undefined-behavior or trap all FP operations where the output type is not the same type as the intermediate type or the input conversion is not exact. This leaves us free to define better semantics later as another ISA extension without being a breaking change for SW.

e.g.:
* sv.fadd/sw=f32/dw=f64
  is defined since both the output and intermediate
  types are f64.
* sv.fadd/sw=f64/dw=f32
  is UB or trap since the output type (f32) isn't
  the intermediate type (f64).
* sv.ctfpr/sw=64/dw=f16
  is defined since the output defines the intermediate
  type since the input isn't FP.
* sv.fadds/sw=f32/dw=f64
  is defined since both the output and intermediate
  types are f64.
* sv.fadd/sw=f64/dw=f16
  is UB or trap since the output type (f16) isn't
  the intermediate type (f64).
* sv.fadd/sw=f16/dw=bf16
  is UB or trap since for the intermediate type being:
  * f16 -- the output type doesn't match the intermediate type
  * bf16 -- the input conversion isn't exact (f16 has more mantissa bits)