713 – PartitionedSignal enhancement to add partition-context-aware lengths

Bug 713 - PartitionedSignal enhancement to add partition-context-aware lengths

Summary: PartitionedSignal enhancement to add partition-context-aware lengths

Status:	CONFIRMED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Source Code (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Luke Kenneth Casson Leighton

URL:	https://libre-soc.org/3d_gpu/architec...

Depends on:	132
Blocks:	732
	Show dependency tree / graph

Reported:	2021-09-30 21:13 BST by Luke Kenneth Casson Leighton
Modified:	2021-10-28 12:41 BST (History)
CC List:	2 users (show)

See Also:	458 115 722 727 716 734
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2021-09-30 21:13:37 BST

in https://bugs.libre-soc.org/show_bug.cgi?id=708#c4 the idea was raised
to allow the length of signals within each partition to be *less* than
the full length that each partition is actually capable of, and for
the length limit to be dynamically specifiable depending on the
runtime partition settings.

for example:

* a PartitionedSigbal of size 32
* four partitions allowed (8-8-8-8)
* when the partition lengths are set to e.g. 16-16 the actual signals
  are restricted to 8-8, with the upper 8 bits of each 16-16 being
  ignored
* when the partitions are set to e.g. 24-8 the actual signals are
  restricted to 16-4

these alternative lengths to *simultaneously* be supported at runtime.

whilst in many cases "workarounds" can be done which mitigate the need
for this as a primary feature, certain circumstances such as IEEE754
FP exponent and mantissa being strange non-partition-aligning values
depending on the partition settings make this challenging to workaround.

this therefore needs a proper evaluation including estimating the time
taken to design, write unit tests and Formal Correctness Proofs, as well
as evaluate if it is possible to cleanly adapt the existing PartitionedSignal.

Documented here: https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/

Comment 1 Luke Kenneth Casson Leighton 2021-10-01 00:24:40 BST

i gave this some thought and i believe it should be reasonably
straightforward to adapt PartitionedSignal and the 15 or so
specialist modules behind it to cope with adaptive Signal truncation.

the more recent ones (Cat, Assign) should be easy to enhance
because they loop through all combinations available to the
mask and create a switch statement.

truncating and zeroing the output from each lane, in each
of those switch statements, is therefore pretty straightforward:
it can be handled by a function which performs the truncation.

the required truncation width is very straightforward to look up
on a per switch basis.  concatenation of the post-truncated partial
results is also simple, exactly as is done already.

the earlier ones (eq gt le shift) which have already been optimised
and are based around a gate-level design principle will take a bit
more thought and effort.

based on appx average:

* half a day per code-change (to add to the specialised module)
* half a day for unit test
* half a day for Formal Proof

and there being around 20 Type I (Ast) "operators":

* shift add sub neg
* or not and xor
* eq gt lt
* Assign Cat Part Switch Repl
* bool all some any xor

the time estimated for this task is around 35 days.  some of these will
be less (and, or, not), some a lot longer (shift/rot) due to them
slready being optimised, and particularly complex (3 weeks) to originally
analyse.

Comment 2 Luke Kenneth Casson Leighton 2021-10-03 13:35:22 BST

in the existing (regularly-allocated, full-allocated) design the partitions
do not need special treatment.  within a given partition the partial result
is always computed, and from that an optimisation has been possible to deploy
at the gate level which a naive "switch" on the mask could not possibly
hope to replicate.

a variable-length per-partition breaks that... *unless...

a pre-analysis phase is carried out which splits out the high-level "required" partitioning from a *lower-level* partitioning.

settings:

* all 32 and 16-bit values are actually to be truncated to 11 bit
* all 8-bit values to 5-bit

from these we can write out:
 
      31         16 15      8 7     0
32bit                   |10 ....    0 |
16bit     26 ... 16 |   |10 ....    0 |
8bit  | 28.24| 20.16|  12..8 |  4.. 0 |

thus, we deduce, we *actually* need breakpoints at:

        28 26 24  16   12  8    4   0

and when 32bit is set:

* 10 is closed
* 11-31 "masked to zero" (clock gated, in future)

when 16bit:

* 10 is closed
* 16 is closed
* 26 is closed
* 10-15 is zeromasked / clock-gated
* 27-31 is zeromasked / clockgated

when 8bit:

* 4 is closed
* 8 is closed
* 12 is closed
* 16 is closed
* 24 is closed
* 28 is closed

which is a lot more partitions, and a lot more work, but provides an
implementation *in terms of existing code* with very little disruption.

it's also very straightforward.

reconstruction of results is also easily done by leveraging the existing
optimised code: it is just that the partition sizes are non-uniform lengths.

a "map" function can easily take care of converting the incoming
spec (required lengths) to actual partition sizes and an "active"
mask, which, interestingly, can hook directly into the planned "Assign"
conditional (predicated) assignment

some care needs to be taken to ensure that the inputs are not too
heavily MUXed, creating long gate delays.

at the moment all inputs are unconditional, the selection is done on
the output.

Cat will need some more thought, to see if the irregular size can be dealt with.

in particular, construction of the "size spec" for the output may be a bit hairy.  each possible length combination under each partition mask value
will have to be calculated.

Comment 3 Luke Kenneth Casson Leighton 2021-10-03 23:50:49 BST

PartitionedSignal.shape() is a good function to use
as the definition of the lengths.

question: is a PartitionedShape necessary and what are the
implications if derived from ast.Shape()

Comment 4 Luke Kenneth Casson Leighton 2021-10-04 16:09:08 BST

insight:

initially thought, "why have gaps at all", why not have the alternative
length parts back-to-back.

Spec:

* 32bit -> 11 bit
* 16bit -> 2x 10 bit
* 8bit  -> 4x 8 bit

Actual layout:

* 32 bit   10..0
* 16 bit   19..10 plus 9..0
* 8 bit    31..24 23..16 15..8 7..0

no.

it has to be:

* 32 bit                      10..0
* 16 bit          25..16       9..0
* 8 bit    31..24 23..16 15..8 7..0

because of alignment and when "interacting" as the input/output involving
other (regularly sized) partitioned signals.

without the alignment to regular boundaries you actually need special
conditional partition-aware routing

very awkward.

alignment cleans that up.

but, has to have "gaps" as a result.

Comment 5 Luke Kenneth Casson Leighton 2021-10-05 20:41:39 BST

to make sure this thought is written down: i realised a few days ago,
some of the Case/Switch submodules for PartitionedSignal, particularly
Cat, Part and Assign, are approaching the gate density of full crossbars.
that isn't going to fly.

a way to reduce the number of required entries is needed: the option
to specify what combinations of mask are supported.

Comment 6 Jacob Lifshay 2021-10-05 22:58:43 BST

I already wrote code to fully determine the layout needed for partitioned signals including padding (SimdLayout), as well as separating out the actual signals determining which partitions are enabled across a whole SIMD circuit (SimdPartMode) -- making it easy to convert layouts and connect signals with different layouts but that are still partitioned in the same abstract way -- essentially creating a specific Python class where its instances represent compatible categories of layouts.

It also has code for representing the different ways padding can be filled (SimdPaddingKind) -- it probably should be merged into SimdLayout.

https://salsa.debian.org/Kazan-team/simd_signal

See TestSimdPartMode and TestSimdLayout in test_simd_signal.py for some examples.

All that is needed is to adapt code so SimdLayout and SimdPartMode work with PartitionedSignal.

Comment 7 Jacob Lifshay 2021-10-05 23:03:09 BST

(In reply to Luke Kenneth Casson Leighton from comment #5)
> to make sure this thought is written down: i realised a few days ago,
> some of the Case/Switch submodules for PartitionedSignal, particularly
> Cat, Part and Assign, are approaching the gate density of full crossbars.
> that isn't going to fly.
> 
> a way to reduce the number of required entries is needed: the option
> to specify what combinations of mask are supported.

SimdPartMode partially solves this problem by explicitly reducing the set of supported layouts to those that you can produce by a buddy allocator. this removes the less useful lane sizes (that we realistically weren't going to ever use anyway) and retains the more useful ones.

example table of supported and excluded layouts:
https://salsa.debian.org/Kazan-team/simd_signal/-/blob/b77b188c51c91c3ddfdd6f41c9e22e0bfa5d41e3/src/tests/test_simd_signal.py#L19

Comment 8 Luke Kenneth Casson Leighton 2021-10-06 01:48:39 BST

(In reply to Jacob Lifshay from comment #6)
> I already wrote code to fully determine the layout needed for partitioned
> signals including padding (SimdLayout),

ok good, because this moves out of the area originally for which PS
was designed.

> All that is needed is to adapt code so SimdLayout and SimdPartMode work with
> PartitionedSignal.

excellent, that eould be great.

i am having a bit of a hard time understanding SimdLayout,
etc., can you cut the type information, all "casting", they are
extraneous lines, and add some docstrings explaining the
purpose of the main functions?

my feeling is, it should be possible to return an object
from the standard override of Value.shape() and use that.
most cases shape() is used to determine the signedness.
if that doesn't pan out it can be adapted.

a way to fit with PartitionPoints is needed: an adapter
that creates the dict that goes straight into Partition
Points yet also can be used to return the list from
which the case statements can be constrcted, will find
an example

there is too much code to do intrusive rewrites btw

ok here we go:
https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_cat/cat.py;hb=HEAD#l100

that's dirt simple.  the switch is on the mask (extracted from Ppoints),
it enumerates all possible values.

*without* being massively complex, we need something that *does not*
REPLACE PartitionPoints but feeds input *TO* PartitionPoints then
when enumerated instead of creating a full set of all combinations
of cases only creates those actually needed

or, adds a simple function *to* PartitionPoints which returns
the options.

bear in mind that some of the code cannot be adapted to *not*
generate full options.  it doesn't work by switch/case, it's
gate-level.

these submodules all take the same mask bits hence why it is
necessary to fit with straight mask bits and keep PartitionPoints,
not try ripping things apart.

*careful* compatibility and *bare minimum* nondisruptive changes.

but first please can you take out type info and add docstrings
so i can see what is going on with SimdLane

Comment 9 Luke Kenneth Casson Leighton 2021-10-06 05:36:11 BST

going back to what is needed:

* elwidth as input
   (or, an enum, or string)
* for each elwidth, an actual width (power of 2)
   (or, the number of subdivision)
* and a (nonp2) "use this many bits" width

this suggests a dict as the spec.  mantissa:

{ 0b00 : (64, 53),  # FP64
  0b01 : (32, 23),  # FP32x2
  0b10 : (16, 10),  # FP16x4
  0b11 : (16, 5),  # BF16x4
}

no types, no classes.

turning that into PartitionPoints, a simple
function creates a dict and a mask Signal.

this i would expect to be about... 30 lines.

a second function, assuming an Enum Signal (or elwidth
Signal of 2 bits) would return a list of comb
statements that:

* accept the enum as input
* performs a switch
* sets the relevant mask bits

that would again be no more than about... 25 lines.

could you do a quick hack prototype of the first of
those functions?  take dict as input (and width parameter)
and return part points dictionary?

Comment 10 Jacob Lifshay 2021-10-06 06:21:27 BST

(In reply to Luke Kenneth Casson Leighton from comment #9)
> going back to what is needed:
> 
> * elwidth as input
>    (or, an enum, or string)
> * for each elwidth, an actual width (power of 2)
>    (or, the number of subdivision)
> * and a (nonp2) "use this many bits" width
> 
> this suggests a dict as the spec.  mantissa:
> 
> { 0b00 : (64, 53),  # FP64
>   0b01 : (32, 23),  # FP32x2
>   0b10 : (16, 10),  # FP16
>   0b11 : (16, 5),  # BF16
> }
> 
> no types, no classes.

Oh, what I have as types that can be SimdLayout.cast (like nmigen Shape.cast) are 1 of 3 options:
1. dict-like types that map lane-size-in-abstract-parts to Shape-castable values:
This (after Shape.cast-ing all dict values) is the canonical internal representation of a layout (stored in self.lane_shapes).
example:
{ # keys are always powers of 2
    1: 5, # 1-part lanes are u5
    2: unsigned(3), # 2-part lanes are u3
    4: range(3, 25), # 4-part lanes are u5 since that fits 24
    8: MyEnum, # 8-part lanes are u10, assuming MyEnum fits in u10
}
or:
{ # keys are always powers of 2
    1: signed(1), # 1-part lanes are i1
    2: signed(3), # 2-part lanes are i3
    4: range(-30, 25), # 4-part lanes are i6 since that fits -30
    8: MySignedBoolEnum, # 8-part lanes are i1
    16: signed(0), # 16-part lanes are i0, zero-bit shapes are supported
}

2. callables that take lane-size-in-abstract-parts and return Shape-castable values. Useful for making non-uniform layouts quickly.
example:
lambda lane_size: signed(lane_size * 8)

dict equivalent (for part_count == 8):
{
    1: signed(8),
    2: signed(16),
    4: signed(32),
    8: signed(64),
}

3. a Shape-castable value that is used for all lane-sizes. Useful for making uniform layouts quickly.
example:
5

dict equivalent (for part_count == 8):
{
    1: unsigned(5),
    2: unsigned(5),
    4: unsigned(5),
    8: unsigned(5),
}


From that dict-equivalent (stored as self.lane_shapes), it computes the minimum number of bits needed for each abstract part required to pack those chosen lane shapes into their abstract parts -- the bits needed per abstract part is self.padded_bits_per_part (same value for all parts).

From that, member functions can calculate the total required signal width in bits (self.width), and the unpadded/padded slices for each lane (self.lane_bits_slice(lane, include_padding=False/True)).

> 
> turning that into PartitionPoints, a simple
> function creates a dict and a mask Signal.

PartitionPoints needs to be re-vamped to carry all the necessary information -- that's what SimdLayout is. The mask signal is wrapped by SimdPartMode, which provides all the necessary introspection and utility functions to make it easy to use.

Comment 11 Jacob Lifshay 2021-10-06 07:03:16 BST

(In reply to Luke Kenneth Casson Leighton from comment #8)
> (In reply to Jacob Lifshay from comment #6)
> > I already wrote code to fully determine the layout needed for partitioned
> > signals including padding (SimdLayout),
> 
> ok good, because this moves out of the area originally for which PS
> was designed.
> 
> > All that is needed is to adapt code so SimdLayout and SimdPartMode work with
> > PartitionedSignal.
> 
> excellent, that eould be great.
> 
> i am having a bit of a hard time understanding SimdLayout,
> etc., can you cut the type information, all "casting", they are
> extraneous lines, and add some docstrings explaining the
> purpose of the main functions?
> 
> my feeling is, it should be possible to return an object
> from the standard override of Value.shape() and use that.
> most cases shape() is used to determine the signedness.
> if that doesn't pan out it can be adapted.

I think we should use:
class PartitionedSignal(...):
    ...
    def __init__(self, layout=1, *, ...):
        self.layout = SimdLayout.cast(layout)
        ...
    shape = None

since shape is limited by nmigen to just Shape(width,signedness) and nothing else.
> 
> a way to fit with PartitionPoints is needed: an adapter
> that creates the dict that goes straight into Partition
> Points yet also can be used to return the list from
> which the case statements can be constrcted, will find
> an example

well, unfortunately, PartitionPoints just can't really express the required padding info:
assuming a layout of:
SimdLayout({
    1: unsigned(10),
    2: unsigned(12),
    4: unsigned(15),
})
which ends up with a 40-bit signal with lanes like so:
bit #: 30v       20v       10v         0
[---u10--][---u10--][---u10--][---u10--]
[ppppppp----u12----][ppppppp----u12----]
[pppppppppppppppppppppppp------u15-----]

There are 2 possible approaches for emulating PartitionPoints, neither good:
1. if we have the padding as separate partitions, then the created circuits will be unnecessarily complex due to excessive partitions with partition points at bits:
10, 12, 15, 20, 30, and 32. This gets significantly worse for 8-part layouts.

2. if we have padding included in their partitions, which gives partition points of 10, 20, and 30, then many of the existing operators will calculate the wrong results since they failed to account for padding.

> but first please can you take out type info and add docstrings
> so i can see what is going on with SimdLane

yeah, i'll add docs. the type info *is* part of the docs, just put in a standardized spot where tools can read it. after all, someone writing code needs to be able to quickly know what types a function expects, not have to look through 10(!) different files following a chain of nested function calls just to figure out how to call a particular function. That actually happened to me several times when I was trying to write the div pipe, whereas if they had docs that specified what types they expected, i would have saved a bunch of time (10s 
instead of 10min).

Comment 12 Luke Kenneth Casson Leighton 2021-10-06 12:16:54 BST

(In reply to Jacob Lifshay from comment #11)
> (In reply to Luke Kenneth Casson Leighton from comment #8)
> > (In reply to Jacob Lifshay from comment #6)
> > > I already wrote code to fully determine the layout needed for partitioned
> > > signals including padding (SimdLayout),
> > 
> > ok good, because this moves out of the area originally for which PS
> > was designed.
> > 
> > > All that is needed is to adapt code so SimdLayout and SimdPartMode work with
> > > PartitionedSignal.
> > 
> > excellent, that eould be great.
> > 
> > i am having a bit of a hard time understanding SimdLayout,
> > etc., can you cut the type information, all "casting", they are
> > extraneous lines, and add some docstrings explaining the
> > purpose of the main functions?
> > 
> > my feeling is, it should be possible to return an object
> > from the standard override of Value.shape() and use that.
> > most cases shape() is used to determine the signedness.
> > if that doesn't pan out it can be adapted.
> 
> I think we should use:
> class PartitionedSignal(...):
>     ...
>     def __init__(self, layout=1, *, ...):
>         self.layout = SimdLayout.cast(layout)
>         ...
>     shape = None
> 
> since shape is limited by nmigen to just Shape(width,signedness) and nothing
> else.
> > 
> > a way to fit with PartitionPoints is needed: an adapter
> > that creates the dict that goes straight into Partition
> > Points yet also can be used to return the list from
> > which the case statements can be constrcted, will find
> > an example
> 
> well, unfortunately, PartitionPoints just can't really express the required
> padding info:
> assuming a layout of:
> SimdLayout({
>     1: unsigned(10),
>     2: unsigned(12),
>     4: unsigned(15),
> })
> which ends up with a 40-bit signal with lanes like so:
> bit #: 30v       20v       10v         0
> [---u10--][---u10--][---u10--][---u10--]
> [ppppppp----u12----][ppppppp----u12----]
> [pppppppppppppppppppppppp------u15-----]
> 
> There are 2 possible approaches for emulating PartitionPoints, neither good:
> 1. if we have the padding as separate partitions, 

yes. this is the simplest option that does not involve throwing away tested
code.  it is an incremental nondisruptive approach.

>then the created circuits
> will be unnecessarily complex due to excessive partitions with partition
> points at bits:
> 10, 12, 15, 20, 30, and 32. This gets significantly worse for 8-part layouts.

not if the combinations (assuming a switch) are limited, and not if an
additional "assignment" mask is added which, *incrementally and one at a time*,
each of the submodules is evaluated and adapted.

"inefficient but functional" is a crucial thing right now because this needs
to be done FAST because it's holding up further development (critical path).

yet at the same time over-engineering is going to bite us.

> 2. if we have padding included in their partitions, which gives partition
> points of 10, 20, and 30, then many of the existing operators will calculate
> the wrong results since they failed to account for padding.

that will only happen if the starting points are not set, as a requirement,
to be on an "aligned" boundary (exactly like byte/word alignment
for software).

i covered this in comment #2


 
> > but first please can you take out type info and add docstrings
> > so i can see what is going on with SimdLane
> 
> yeah, i'll add docs. the type info *is* part of the docs, just put in a
> standardized spot where tools can read it. 

honestly they're a nuisance.  massive over-engineering, resulting
in so much cruft to "manage the existence of the type" it at least
triples the amount of code, end result *harder* to read, not easier.

the example there *actually* has 7 breakpoints, can use PartitionPoints
to do it, but simply needs an "adapter" function (pair of adapter
functions, total about 60 lines to do it.

the adapter function takes just the 3 options shown as inputs (which
will go into the switch) and outputs *only* 3 possible cases setting
the right combination of 7 partition bits.

the "unused" bits in between although they calculate results those
results are *ignored* or not wired in or out, but even if not
ignored they do no harm.

*LATER* we zero them out, and even later we add individual clock
gating on some of them if they're partly used.  but we can't do clock
gating yet because we don't have a clock gating cell in the cell library.

it's fully compatible with the existing code, nondisruptive, and has the
advantage that same sized same spec signals can be straight-wired, no
messing about with partition-aware muxing just to connect a dann signal.


> after all, someone writing code
> needs to be able to quickly know what types a function expects, not have to
> look through 10(!) different files following a chain of nested function
> calls just to figure out how to call a particular function. That actually
> happened to me several times when I was trying to write the div pipe,
> whereas if they had docs that specified what types they expected, i would
> have saved a bunch of time (10s 
> instead of 10min).

chains like that is pretty standard fare for a big project.

tips:  keep the files open, on-screen,
in multiple terminals, persistently, as a working set.

i use a 6x4 virtual desktop and move to a different view, leaving the
persistent xterms open.  last record was 300 days (power cut),  3,500
vim sessions in 100 xterms on 15 virtual destkops.

Comment 13 Luke Kenneth Casson Leighton 2021-10-06 12:27:00 BST

(In reply to Jacob Lifshay from comment #11)

> assuming a layout of:
> SimdLayout({
>     1: unsigned(10),
>     2: unsigned(12),
>     4: unsigned(15),
> })

noooo, different types of signedness down at the lane level is
completely out of the question.

that significantly changes the design and, worse, breaks a fundamental
assumption that the object can be treated as "if it was one thing"
by nmigen Type 1 (ast.*) and Type 2 (dsl.Module) language constructs.

this would be very bad and very complicated, very quickly.

and i don't see any need for it.

therefore, the signedness can in fact be removed, entirely, and nmigen
Shape signed/unsigned flag used.

Comment 14 Luke Kenneth Casson Leighton 2021-10-06 13:01:05 BST

(In reply to Jacob Lifshay from comment #10)
> (In reply to Luke Kenneth Casson Leighton from comment #9)

> > this suggests a dict as the spec.  mantissa:
> > 
> > { 0b00 : (64, 53),  # FP64
> >   0b01 : (32, 23),  # FP32x2
> >   0b10 : (16, 10),  # FP16
> >   0b11 : (16, 5),  # BF16
> > }
> > 
> > no types, no classes.
> 
> Oh, what I have as types that can be SimdLayout.cast (like nmigen
> Shape.cast) are 1 of 3 options:

3 options when one will do and can be covered by a dict
is massive complication and overengineering.

the *actual* needs come from a 2-bit elwidth, where at each
elwidth i would have said even the power-2 is implicitly
understood, if it wasn't for the fact that in FP we specify
BF16 as one of the options.

int elwidths:

0b00 64
0b01 32
0b10 16
0b11 8

FP:

0b00 FP64
0b01 FP32
0b10 FP16
0b11 BF16 16 bit not 8

that puts the "aligned power 2" width (or the number of power2 partitions)
on the requirements, and it can be the first of the tuple under each key.

the second requirement is the *useful* width at each elwidth, nonpow2
sized.

there are no other requirements, because supporting different signed/unsigned
is out of the question.

it's very much simpler than SimdKind.

> 1. dict-like types that map lane-size-in-abstract-parts to Shape-castable
> values:
> This (after Shape.cast-ing all dict values) is the canonical internal
> representation of a layout (stored in self.lane_shapes).
> example:
> { # keys are always powers of 2
>     1: 5, # 1-part lanes are u5
>     2: unsigned(3), # 2-part lanes are u3
>     4: range(3, 25), # 4-part lanes are u5 since that fits 24
>     8: MyEnum, # 8-part lanes are u10, assuming MyEnum fits in u10
> }

we need neither Enums nor range, and *definitely* signed/unsigned is
out of the question.

even if using this type of specification, how does it relate to
elwidths?

> or:
> { # keys are always powers of 2
>     1: signed(1), # 1-part lanes are i1
>     2: signed(3), # 2-part lanes are i3
>     4: range(-30, 25), # 4-part lanes are i6 since that fits -30
>     8: MySignedBoolEnum, # 8-part lanes are i1
>     16: signed(0), # 16-part lanes are i0, zero-bit shapes are supported
> }

no.  range enum and signed...  arrgh, these are *subtypes*? 

no, absolutely not.  no way. this is far too advanced, far too complicated.


> 2. callables that take lane-size-in-abstract-parts and return Shape-castable
> values. Useful for making non-uniform layouts quickly.
> example:
> lambda lane_size: signed(lane_size * 8)
> 
> dict equivalent (for part_count == 8):
> {
>     1: signed(8),
>     2: signed(16),
>     4: signed(32),
>     8: signed(64),
> }

signed is out of the question

> 3. a Shape-castable value that is used for all lane-sizes. Useful for making
> uniform layouts quickly.
> example:
> 5
> 
> dict equivalent (for part_count == 8):
> {
>     1: unsigned(5),
>     2: unsigned(5),
>     4: unsigned(5),
>     8: unsigned(5),
> }

unsigned is out of the question.


> 
> From that dict-equivalent (stored as self.lane_shapes), it computes the
> minimum number of bits needed for each abstract part required to pack those
> chosen lane shapes into their abstract parts -- the bits needed per abstract
> part is self.padded_bits_per_part (same value for all parts).
> 
> From that, member functions can calculate the total required signal width in
> bits (self.width), and the unpadded/padded slices for each lane
> (self.lane_bits_slice(lane, include_padding=False/True)).
> 
> > 
> > turning that into PartitionPoints, a simple
> > function creates a dict and a mask Signal.
> 
> PartitionPoints needs to be re-vamped to carry all the necessary information
> -- that's what SimdLayout is.

above i have established that you are making fundamental assumptions
(not discussed until now and only just discovering what they are)
which was part of the silence and rushing ahead creating code without
first talking it through.

what you *believe* is necessary information is in fact unnecessary and
massive over-engineering, making for several days wasted coding time
unfortunately.

*if* it was actually practical to support massive numbers of disparate data
types *then* SimdKind etc. would be necesssary.

but you didn't talk this through before going ahead so have gone down
a blind alley.

from the *much* more basic pragmatic requirements we can incrementally
leverage existing code.

if during the actual deployment we end up finding use-cases that are not
covered *then* we can stop and reevaluate.

however by keeping it SIMPLE from the very start and taking some short cuts
we can get the job done FAST, and optimise later.

practical pragmatic engineering.

so.

with that in mind, can you please write a quick prototype pair of
functions which use a dict as input, as in comment #2, returning a
PartitionPoints as a result from the first, and the second function
to take a 2 bit Signal, the same dict spec, a PartitionPoint instance
and a module (m), where it uses the 2 bit Signal as input to a
Switch and the cases set the prerequisite mask bits embedded
as values in the PP instance?

each function should (he said) be self-contained, not require
calling any external functions, and be absolute maximum 35
lines (without comments)

*no types*, not even classes! throw them into the codebase near
make_partition2 (above PartitionPoints) with some really quick
demo tests, we can work out where to put them later.

this is an important and valuable lesson for you to learn
to do ultra-basic pragmatic programming that an HDL Engineer or
a 20 year old 2nd year undergraduate student can
look and and go, "oh, that's blindingly obvious".

Comment 15 Jacob Lifshay 2021-10-06 18:32:35 BST

(In reply to Luke Kenneth Casson Leighton from comment #13)
> (In reply to Jacob Lifshay from comment #11)
> 
> > assuming a layout of:
> > SimdLayout({
> >     1: unsigned(10),
> >     2: unsigned(12),
> >     4: unsigned(15),
> > })
> 
> noooo, different types of signedness down at the lane level is
> completely out of the question.

Actually, SimdLayout specifically asserts that all lanes have the same signedness, essentially it's just there to make passing in Shapes easy, and because, conceptually, it's the lanes who's shape we care about, the whole signal could be signed/unsigned/whatever it doesn't matter as long as the lanes behave as expected.

> that significantly changes the design and, worse, breaks a fundamental
> assumption that the object can be treated as "if it was one thing"
> by nmigen Type 1 (ast.*) and Type 2 (dsl.Module) language constructs.

As currently designed SimdLayout is *not* a subclass of Shape, so idk where you think it fits into nmigen, because it doesn't -- nmigen is never passed a SimdLayout instance.

Comment 16 Jacob Lifshay 2021-10-06 19:32:52 BST

(In reply to Luke Kenneth Casson Leighton from comment #14)
> (In reply to Jacob Lifshay from comment #10)
> > (In reply to Luke Kenneth Casson Leighton from comment #9)
> 
> > > this suggests a dict as the spec.  mantissa:
> > > 
> > > { 0b00 : (64, 53),  # FP64
> > >   0b01 : (32, 23),  # FP32x2
> > >   0b10 : (16, 10),  # FP16
> > >   0b11 : (16, 5),  # BF16
> > > }
> > > 
> > > no types, no classes.
> > 
> > Oh, what I have as types that can be SimdLayout.cast (like nmigen
> > Shape.cast) are 1 of 3 options:
> 
> 3 options when one will do and can be covered by a dict
> is massive complication and overengineering.

internally, it always uses a dict mapping abstract-lane-sizes (pow2 integers) to nmigen Shape instances. the other options (only as inputs to SimdLayout.cast/.__init__) are there because it saves a bunch of code on the caller's side, making it much easier to read, and because accepting a Shape directly (option 3) instead of always requiring a dict makes it waay easier to use all our existing non-simd code mostly unmodified.
> 
> the *actual* needs come from a 2-bit elwidth, where at each
> elwidth i would have said even the power-2 is implicitly
> understood, if it wasn't for the fact that in FP we specify
> BF16 as one of the options.

elwidth ends up filling the SimdPartMode.part_starts bits (like original PartitionPoints, except indexing abstract parts rather than bits), SimdLayout uses those Signals to select which possible lanes are enabled, based on the internal dict mapping abstract-lane-sizes to nmigen Shape instances.

The current SimdPartMode/SimdLayout design still supports doing stuff like 4xu8 1xu32 all simultaneously, because, iirc, we are still planning on supporting packing ops into simd alus at the 32-bit level, and we could have a 4xu8 op followed by a 1xu32 op. I initially thought about having an enum as the "mask" signal (like your proposing here) instead of abstracted partition-points, but I rejected that idea because of the sheer number of enumerants needed to express the desired 4xu8 with 1xu32 combinations.

> int elwidths:
> 
> 0b00 64
> 0b01 32
> 0b10 16
> 0b11 8
> 
> FP:
> 
> 0b00 FP64
> 0b01 FP32
> 0b10 FP16
> 0b11 BF16 16 bit not 8

FP16/BF16 -- didn't think of that, will require either signalling outside of SimdLayout or redesigning SimdLayout to cope.
> 
> that puts the "aligned power 2" width (or the number of power2 partitions)
> on the requirements, and it can be the first of the tuple under each key.
> 
> the second requirement is the *useful* width at each elwidth, nonpow2
> sized.
> 
> there are no other requirements, because supporting different signed/unsigned
> is out of the question.

I partially disagree, uniform signed/unsigned is needed and supported by SimdLayout by having input lanes' types be nmigen Shape (or castable via Shape.cast). non-uniform is unnecessarily complicated and SimdLayout will raise AssertionError for that case.

> > 1. dict-like types that map lane-size-in-abstract-parts to Shape-castable
> > values:
> > This (after Shape.cast-ing all dict values) is the canonical internal
> > representation of a layout (stored in self.lane_shapes).
> > example:
> > { # keys are always powers of 2
> >     1: 5, # 1-part lanes are u5
> >     2: unsigned(3), # 2-part lanes are u3
> >     4: range(3, 25), # 4-part lanes are u5 since that fits 24
> >     8: MyEnum, # 8-part lanes are u10, assuming MyEnum fits in u10
> > }
> 
> we need neither Enums nor range, and *definitely* signed/unsigned is
> out of the question.

all of those are converted to Shape instances by nmigen's Shape.cast (called by SimdLayout's constructor). once constructed, the internal fields use only nmigen Shape instances.

Signed/Unsigned is needed because we need to support signed/unsigned multiply, signed/unsigned compare, signed/unsigned divide (as a SIMD ALU, not as a PartitionedSignal op), signed/unsigned right shift, etc.

Oh, i just realized ALU-level signed/unsigned (separate from lane-level signedness) is another thing that needs to go in the SimdPartMode key, along with F16/BF16/etc.

so the key would be like:
class MyIntKey(Enum):
    U8 = ...
    I8 = ...
    U16 = ...
    I16 = ...
    U32 = ...
    I32 = ...
    U64 = ...
    I64 = ...

INT_WIDTH_IN_PARTS = {
    MyIntKey.U8: 1,
    MyIntKey.I8: 1,
    MyIntKey.U16: 2,
    MyIntKey.I16: 2,
    MyIntKey.U32: 4,
    MyIntKey.I32: 4,
    MyIntKey.U64: 8,
    MyIntKey.I64: 8,
}

class MyFpKey(Enum):
    F16 = ...
    BF16 = ...
    F32 = ...
    F64 = ...

FP_WIDTH_IN_PARTS = {
    MyFpKey.F16: 1,
    MyFpKey.BF16: 1,
    MyFpKey.F32: 2,
    MyFpKey.F64: 4,
}

> even if using this type of specification, how does it relate to
> elwidths?

elwidths determine which lane-size-in-abstract-parts is used.
> 
> > or:
> > { # keys are always powers of 2
> >     1: signed(1), # 1-part lanes are i1
> >     2: signed(3), # 2-part lanes are i3
> >     4: range(-30, 25), # 4-part lanes are i6 since that fits -30
> >     8: MySignedBoolEnum, # 8-part lanes are i1
> >     16: signed(0), # 16-part lanes are i0, zero-bit shapes are supported
> > }
> 
> no.  range enum and signed...  arrgh, these are *subtypes*? 

no, they're just types that Shape.cast supports converting to Shape.
> 
> no, absolutely not.  no way. this is far too advanced, far too complicated.

it's easy, we let Shape.cast do all the hard work, SimdLayout just passes the inputs to nmigen Shape.cast.

Comment 17 Luke Kenneth Casson Leighton 2021-10-06 20:32:19 BST

(In reply to Jacob Lifshay from comment #15)
> (In reply to Luke Kenneth Casson Leighton from comment #13)
> > (In reply to Jacob Lifshay from comment #11)
> > 
> > > assuming a layout of:
> > > SimdLayout({
> > >     1: unsigned(10),
> > >     2: unsigned(12),
> > >     4: unsigned(15),
> > > })
> > 
> > noooo, different types of signedness down at the lane level is
> > completely out of the question.
> 
> Actually, SimdLayout specifically asserts that all lanes have the same
> signedness, essentially it's just there to make passing in Shapes easy, and
> because, conceptually, it's the lanes who's shape we care about, the whole
> signal could be signed/unsigned/whatever it doesn't matter as long as the
> lanes behave as expected.

jacob: PartitionedSignal does not comprehend lane-based independent signs
or types.  none of the submodules know about it.

there is *one* internal Signal which is given the shape at the max bitwidth
required and its sign and that's it.

adding anything else on top of that is too much.

the requirements here are very very simple, this is an enhancement we are
discussing which cannot disrupt the existing code and has to be minimal and
ultra quick and simple to implement.

you could have written the functions by now, they are far shorter than
the text that you have typed *resisting* the path that i am trying to
illustrate to you and the important insight i would like to share.

if i am forced to do the code myself you will in no way have
the same insight and understanding.

please: can you write the two functions, they should
take you only about 20 minutes absolute max.

Comment 18 Luke Kenneth Casson Leighton 2021-10-06 20:44:07 BST

(In reply to Jacob Lifshay from comment #16)

> it's easy, we let Shape.cast do all the hard work, SimdLayout just passes
> the inputs to nmigen Shape.cast.

it's massively overengineered.  the very fact that you wrote text 5x longer
than the simple functions i outlined should be telling you everything
you need to know, here.

it also goes far ahead of what i can understand and process, and is absorbing
far more time than it should.

you're rushing ahead and solving problems that haven't yet been demonstrated
to be needed.

i have *no problem at all* with *ending up* with a design that is as complicated
as SimdKind however it needs to be demonstrated and driven by *actual need*
rather than "predicted non-demonstrated requirements".

can you please therefore write those two simple functions
which will fit into the existing code.

this will then allow progress and effort put into fitting PartitionedSignal
into the ALUs.

*at that* point if we *actually* encounter *actual* needs for advanced features
*then and only then* do we come back and revisit.

yes i used to also overengineer.  it didn't go well.

Comment 19 Luke Kenneth Casson Leighton 2021-10-06 20:48:44 BST

> Signed/Unsigned is needed because we need to support signed/unsigned
> multiply, signed/unsigned compare, signed/unsigned divide (as a SIMD ALU,
> not as a PartitionedSignal op), signed/unsigned right shift, etc.

yes.  and this is already done.  but NOT on a per lane basis.  on a GLOBAL
basis.

therefore sign should not even be part of SimdKind because it is part
of the PartitionedSignal.

please can you simply write those two simple functions which i have asked four
times now and it will become clear when they are integrated into
PartitionedSignal and PartitionPoints.

Comment 20 Jacob Lifshay 2021-10-06 22:03:43 BST

wip layout demo script:
from collections.abc import Mapping
from pprint import pprint
# stuff to let it run as stand-alone script
class Shape:
    @staticmethod
    def cast(v):
        if isinstance(v, Shape):
            return v
        assert isinstance(v, int)
        return Shape(v, False)

    def __init__(self, width=1, signed=False):
        self.width = width
        self.signed = signed

    def __repr__(self):
        if self.signed:
            return f"signed({self.width})"
        return f"unsigned({self.width})"

def signed(w):
    return Shape(w, True)

def unsigned(w):
    return Shape(w, False)

def PartitionPoints(pp):
    return pp

# main fn
def layout(elwid, part_counts, lane_shapes):
    if not isinstance(lane_shapes, Mapping):
        lane_shapes = {i: lane_shapes for i in part_counts}
    lane_shapes = {i: Shape.cast(lane_shapes[i]) for i in part_counts}
    signed = lane_shapes[0].signed
    assert all(i.signed == signed for i in lane_shapes.values())
    part_wid = -min(-lane_shapes[i].width // c for i, c in part_counts.items())
    part_count = max(part_counts.values())
    width = part_wid * part_count
    points = {}
    for i, c in part_counts.items():
        def add_p(p):
            points[p] = points.get(p, False) | (elwid == i)
        for start in range(0, part_count, c):
            add_p(start * part_wid) # start of lane
            add_p(start * part_wid + lane_shapes[i].width) # start of padding
    points.pop(0, None)
    points.pop(width, None)
    return (PartitionPoints(points), Shape(width, signed), lane_shapes,
        part_wid, part_count)


part_counts = {
    0: 1,
    1: 1,
    2: 2,
    3: 4,
}

for i in range(4):
    pprint((i, layout(i, part_counts, unsigned(3))))

for i in range(4):
    l = {0: signed(5), 1: signed(6), 2: signed(12), 3: signed(24)}
    pprint((i, layout(i, part_counts, l)))

Comment 21 Luke Kenneth Casson Leighton 2021-10-07 00:23:06 BST

(In reply to Jacob Lifshay from comment #20)
> wip layout demo script:

fantastic, i suspected it would be that short.  i'll test
it tomorrow.  thank you, really, for trusting me.

the second function "setup" stuff will be a bit longer,
but perhaps a Simulation by doing m = Module() would
evade needing to write an Elaboratable class,

the function needs to have as args:

* the dict spec
* a 2-bit Signal() (for now)
* the PartitionPoints that were created from the dict
* a module

it should:

* create an m.Switch on the 2bit signal which
* enumerates the dict key/values
* for each key, "enable" the mask bits that were
  linked to it in the layout() in the PartitionPoints

this function should likewise be ridiculously short.
if layout() needs to adapt/change so that there is
a way to *get* the partition setting on a per-spec-key
basis, either as just a simple binary value that is literally
chucked at PP.mask.values() feel free.

so if the dict spec was

0b00: (64, 11)
0b01: (...)
...
...

then the extra dict returned from layout
which setss mask values would be something
like

0b00: 0b1111000, # 1s to cut off (empty) partitions
0b01: 0b0010100

this rather than the actual numerical partition points
because PP.mask.values() gives you the underlying 
mask, Cat() them together and assignment from the dict
is possible.

i can never remember which way round PPs specify masks,
if it's set to close or set to open a partition, apologies

Comment 22 Luke Kenneth Casson Leighton 2021-10-07 05:56:36 BST

(In reply to Jacob Lifshay from comment #20)

a quick readthrough:

> def layout(elwid, part_counts, lane_shapes):

using the FP example dict from comment #2, i think you might have misunderstood:
the idea is not to create a set of *separate* PartitionPoints
one per elwidth, the idea is to create a *merged* set of PartitionPoints
(and associated Signal mask)

normally PartitionPoints has to take a mask as input,
thus, we need a function that calculates the length of
that mask (see make_partition2 as an example)


>     if not isinstance(lane_shapes, Mapping):
>         lane_shapes = {i: lane_shapes for i in part_counts}
>     lane_shapes = {i: Shape.cast(lane_shapes[i]) for i in part_counts}
>     signed = lane_shapes[0].signed
>     assert all(i.signed == signed for i in lane_shapes.values())

by removing signed all of these lines can go.  this reduces the function
by 25%.   lane_shapes can become a straight dict containing
numbers, Shape can dral with the signed argument.

>     part_wid = -min(-lane_shapes[i].width // c for i, c in
> part_counts.items())
>     part_count = max(part_counts.values())

unfortunately this lwts the function determine the lwngth
of the underlying Signal, which is where i think you are getting
the mistaken impression that it is impossible to use PartitionedSignal
from.

the overall width of the *full* signal should be passed in as
an argument.

>     width = part_wid * part_count

here.

width needs to an input parameter *to* layout(), not a
computed quantity, where part_wid is instead calculated
by division.

*then* Signals will end up at uniform predetermined lengths
and *then* straight assignment (wire-only  no muxes) can occur

if the underlying signal ends up varying by arbitrary lengths
determined by the pieces used, of *course* you can't wire them
up safely because tbey're all different total sizes.

fixing the length makes the power 2 alignment sizes all the same,
problem solved.

>     points = {}
>     for i, c in part_counts.items():
>         def add_p(p):
>             points[p] = points.get(p, False) | (elwid == i)

this needs to allocate the actual mask Signal, i'm not totally
sure how, with the length not yet being determined, hm.

>         for start in range(0, part_count, c):
>             add_p(start * part_wid) # start of lane
>             add_p(start * part_wid + lane_shapes[i].width) # start of padding
>     points.pop(0, None)
>     points.pop(width, None)

ok now the length is known, now a Signal(len(points))
can be created, then a bit each dropped into the values
in sequence order of its keys, end result just like
make_partition2

>     return (PartitionPoints(points), Shape(width, signed), lane_shapes,
>         part_wid, part_count)

Shape() is not needed because width is a given and signed is
not layout()'s responsibility.

here also the extra dict needs to be returned which gives
the options, "if you want elwidth=0b00, this is the
value you have to set in the PartitionPoints mask
to get that to happen"

> 
> part_counts = {
>     0: 1,
>     1: 1,
>     2: 2,
>     3: 4,
> }

ok yeah, so you went with elwidth => num_of_partitions, cool.
 
> for i in range(4):
>     l = {0: signed(5), 1: signed(6), 2: signed(12), 3: signed(24)}
>     pprint((i, layout(i, part_counts, l)))

and a straight dict for length of subsignals (excl padding).
minus the signed() it is real dead simple.

the only thing being as i said, it produces different
overall widths depending on the input lane_shape which
is a showstopper.

if overall width (64 bit) is given as the input it fixes that.

does mean a safety check is needed, elwidth=0b11 means
4 partitions, 4 partitions of 24 bit obviously won't fit
in 64 bit!

Comment 23 Jacob Lifshay 2021-10-07 06:12:27 BST

(In reply to Luke Kenneth Casson Leighton from comment #22)
> the only thing being as i said, it produces different
> overall widths depending on the input lane_shape which
> is a showstopper.
> 
> if overall width (64 bit) is given as the input it fixes that.
> 
> does mean a safety check is needed, elwidth=0b11 means
> 4 partitions, 4 partitions of 24 bit obviously won't fit
> in 64 bit!

why go through all that when you can just derive the signal width from the lane shapes -- avoiding wasting extra bits or running out of bits, as well as avoiding the need to manually specify size? Having the size be computed doesn't really cause problems imho...

Comment 24 Jacob Lifshay 2021-10-07 06:21:28 BST

(In reply to Luke Kenneth Casson Leighton from comment #22)
> (In reply to Jacob Lifshay from comment #20)
> 
> a quick readthrough:
> 
> > def layout(elwid, part_counts, lane_shapes):
> 
> using the FP example dict from comment #2, i think you might have
> misunderstood:
> the idea is not to create a set of *separate* PartitionPoints
> one per elwidth, the idea is to create a *merged* set of PartitionPoints
> (and associated Signal mask)

this is a misunderstanding, which i'm assuming by your later comments you figured out...you should somehow mark/delete sections of your comment that you know are incorrect before you click "send" so people don't have to waste a bunch of time explaining something you already figured out.

The code creates *one* PartitionPoints based on an input elwidth signal/value.

Comment 25 Jacob Lifshay 2021-10-07 06:45:39 BST

(In reply to Luke Kenneth Casson Leighton from comment #22)
> (In reply to Jacob Lifshay from comment #20)
> >     if not isinstance(lane_shapes, Mapping):
> >         lane_shapes = {i: lane_shapes for i in part_counts}
> >     lane_shapes = {i: Shape.cast(lane_shapes[i]) for i in part_counts}
> >     signed = lane_shapes[0].signed
> >     assert all(i.signed == signed for i in lane_shapes.values())
> 
> by removing signed all of these lines can go.  this reduces the function
> by 25%.   lane_shapes can become a straight dict containing
> numbers, Shape can dral with the signed argument.

I think we should keep those lines, since it allows us to completely specify all aspects of a PartitionedSignal's shape/layout/etc. by giving it one Python value, lane_shapes. elwid and part_counts are retrieved from a global, they're the equivalent of SimdPartMode. The idea is that code will look kinda like this:
self.elwid = Signal(ElWid)
with set_global_in_scope(self.elwid, ElWid.PART_COUNTS):
    ...
    self.a = PartitionedSignal(3) # all lanes are unsigned(3)
    self.b = PartitionedSignal(signed(5)) # all lanes are signed(5)
    # all lanes are whatever shape StateEnum is
    self.state = PartitionedSignal(StateEnum)

    # same shape as self.state, could use .like() or StateEnum directly
    self.state2 = PartitionedSignal(self.state.layout)
    self.mantissa = PartitionedSignal({
        ElWid.F16: 5, # from memory, probably off
        ElWid.BF16: 8,
        ElWid.F32: 8,
        ElWid.F64: 11,
    })
    ...

In particular, notice that you can use nmigen Shapes without needing to write a dict everywhere, greatly reducing the modifications needed to our existing ALUs/FUs/etc.

Comment 26 Jacob Lifshay 2021-10-07 06:54:15 BST

(In reply to Jacob Lifshay from comment #25)
>     self.mantissa = PartitionedSignal({
>         ElWid.F16: 5, # from memory, probably off
>         ElWid.BF16: 8,
>         ElWid.F32: 8,
>         ElWid.F64: 11,
>     })

oops, that was waay off...i wrote mantissa but meant to write exponent

Comment 27 Luke Kenneth Casson Leighton 2021-10-07 13:43:29 BST

(In reply to Jacob Lifshay from comment #23)

> why go through all that when you can just derive the signal width from the
> lane shapes -- avoiding wasting extra bits or running out of bits, as well
> as avoiding the need to manually specify size? Having the size be computed
> doesn't really cause problems imho...

because the idea is to do a literal global/search/replace "Signal"
with "PartitionedSignal" right throughout the entirety of the ALU
codebase.

and the input argument on all current-Signals is not an elwidth, 
it's an "overall width".

(actually, have a PSpec which dynamically determines at run/compile
time *which* class to use: Signal or PartitionedSignal).

when i said "there will be NO fundamental changing of the ALU code"
i REALLY meant it.

this is a hard critical requirement.  we do not have time to waste
on total disruptive redesigns of the existing ALU codebase, tens
of thousands of lines of work and just as many unit tests.

this type of "minimal code-changes" is something i learned the hard
way 24 years ago.

when transitioning to python programming it became even more important
because i witnessed massive python projects, 350,000 to a million lines,
literally falling apart because of low-level fundamental changes where
unit test coverage was either inadequate or in one particular
pretty-much-Corporate-suicidal case, non-existent.

that one was a bundle of fun: the CTO it turned out was sub-consciously
but actively sabotaging efforts to stabilise the codebase, because he
himself hadn't been able to stabilise it in 5 years, therefore he didn't
see why anyone else could either... oh and therefore made *certain* that
was true by changing low-level base classes and committing them to the
active development branch.

bottom line is: i am responsible for ensuring that the codebase is
stable, continuously-operational, and that changes are made with the
absolute bare minimum disruption and in an incremental non-disruptive
manner.

if we want a more complex design it *has* to be done in an *evolutionary*
fashion through a chained-series of *SMALL* incremental patches, one after
the other.

every time you say "but this feature would be great" i have to do a
MASSIVE re-evaluation of the knock-on implications down that chain,
which i am currently maintaining in my head.

this is why it is critically important that, because you haven't been
helping with the HDL for over 18 months, and aren't familiar with it,
to get back into the flow you help do the task that is needed, which
*i* define [i am quite happy to admit that i am slightly embarrassed
to have to put it like that].

if you had helped with the HDL codebase full-time, when i made the
announcement that TestIssuer was to begin development, you would now have
the working knowledge needed.


(In reply to Jacob Lifshay from comment #24)
> (In reply to Luke Kenneth Casson Leighton from comment #22)
> > (In reply to Jacob Lifshay from comment #20)
> > 
> > a quick readthrough:
> > 
> > > def layout(elwid, part_counts, lane_shapes):
> > 
> > using the FP example dict from comment #2, i think you might have
> > misunderstood:
> > the idea is not to create a set of *separate* PartitionPoints
> > one per elwidth, the idea is to create a *merged* set of PartitionPoints
> > (and associated Signal mask)
> 
> this is a misunderstanding, which i'm assuming by your later comments you
> figured out...

i hadn't.

> The code creates *one* PartitionPoints based on an input elwidth
> signal/value.

the usage for-loop confused me from the examples, when it set that
elwidth to zero width:

for i in range(4):
    pprint((i, layout(i, part_counts, unsigned(3))))

if i am reading you correctly, that's a zero-width Signal.

if the example had been:

for i in range(4):
    i = Signal(i) # Signal of width between 0 and 3
    pprint((i, layout(i, part_counts, unsigned(3))))

i would have understood.  maybe. and also wondered why a Signal of width
zero is being passed in.

(In reply to Jacob Lifshay from comment #25)

> I think we should keep those lines, since it allows us to completely specify
> all aspects of a PartitionedSignal's shape/layout/etc.

this defeats the object of the exercise, which is to show you how to do
incremental pragmatic development, and to keep "feet-on-the-ground".

> by giving it one
> Python value, lane_shapes. elwid and part_counts are retrieved from a
> global, they're the equivalent of SimdPartMode. The idea is that code will
> look kinda like this:
> self.elwid = Signal(ElWid)

for practical reasons this is always going to be Elwid=2.  i get
that it would be "nice to have" support for other data types...
.... but... they... are... not... critical.

if they need to be added, they can be added... *aaaafterr* the first
patch lands which adds support for setting Elwidth=2 as a hard-coded
parameter.

that would be a second incremental patch, and that's fine.
but it is NOT fine if it's done even before it's been demonstrated
to even be necessary.

understand?


> with set_global_in_scope(self.elwid, ElWid.PART_COUNTS):
>     ...
>     self.a = PartitionedSignal(3) # all lanes are unsigned(3)
>     self.b = PartitionedSignal(signed(5)) # all lanes are signed(5)
>     # all lanes are whatever shape StateEnum is
>     self.state = PartitionedSignal(StateEnum)

this is the general idea (but self.state *also* hidden behind the
ContextManager)

it needs to be more along the lines of:

with set_global_in_scope(pspec, self.elwid, ElWid.PART_COUNTS) as ps:

     self.a = ps.PSpecContextSensitiveSignalType(3) # all lanes are unsigned(3)

where from pspec the *actual* type of Signal to be deployed is picked up.

that can be **EITHER** a PartitionedSignal (compile-time option 1)
**OR** it can be Signal                    (compile-time option 2)

hence why the arguments need to be exactly the same, and why the dsl.Module
needs to be exactly the same, and why ast.* need to behave exactly the
same.

remember this has 18 months of planning behind it.

there *will* be *zero* code-modifications to turn the entire ALU set
to SIMD.  (see below: if that turns out not to be true, we *make* it
true.  i suspect there are some places where ALU changes might indeed be
needed, sigh, i'm *hoping* this turns out not to be the case).

>     self.mantissa = PartitionedSignal({
>         ElWid.F16: 5, # from memory, probably off

:)

>         ElWid.BF16: 8,
>         ElWid.F32: 8,
>         ElWid.F64: 11,

and yes, this could be one of the exceptions, although i am also hoping that
this can be "derived" from the existing (one-line?) dictionary
that sets the *scalar* FP pipeline types.  i forget exactly where it is.


> In particular, notice that you can use nmigen Shapes without needing to
> write a dict everywhere, greatly reducing the modifications needed to our
> existing ALUs/FUs/etc.

except that standard nmigen Signal doesn't *support* those advanced features,
neither does the existing codebase use them, therefore it is completely
unnecessary to add this as a feature *which we're not going to use*, it is
a waste of time.

even in the remote possibility that my 18+ month experience of having been
the person involved in the development of every single pipeline was not
correct, invalidating that assessment, there are two possibilities:

1) we re-evaluate, and we re-evaluate *at the time that it is needed*

   (which saves wasting time developing something that's not needed,
    which will only confuse other people "why is this here when it's
    not needed or used? it's complicated and unnecessary")

2) we make the tough decision to break the rule "no changes to any
   pipeline code" and AVOID the need entirely.

   (making sure in the process that at every single step of those
    changes that the code still operates when PSpec "mode==scalar,
    please use Signal not PartitionedSignal throughout the entire
    ALU codebase" is set)

bottom line this is all about making the bare minimum code-changes,
across what is already an extremely complex set of inter-connected
parts.

Comment 28 Jacob Lifshay 2021-10-07 20:37:13 BST

(In reply to Luke Kenneth Casson Leighton from comment #27)
> 2) we make the tough decision to break the rule "no changes to any
>    pipeline code" and AVOID the need entirely.

Well, we'll need to break that rule anyway for FP pipelines, which isn't much of an additional problem since they still all need modification anyway to add F16/BF16 and to handle flags/exceptions.

Idea: have a XLEN global that is an instance of a new class SimdMap (a generalization of SimdLayout's dict handling code to arbitrary values per lane-kind -- in this case arbitrary integers) and override *, +, -, etc. so it can do arithmetic like is done in the spec pseudo-code, ending up with a dict wrapped by SimdMap that can be passed into SimdLayout, bit-slicing, shifting, constants, etc. so code could look like this:
# XLEN global constant definition
XLEN = SimdMap({ElWid.I8: 8, ElWid.I16: 16, ElWid.I32: 32, ElWid.I64: 64})

# example definition for addg6s, basically directly
# translating pseudo-code to nmigen+simd.
# intentionally not using standard ALU interface, for ease of exposition:
class AddG6s(Elaboratable):
    def __init__(self):
        with simd_scope(self, IntElWid, make_elwid_attr=True):
            self.RA = PartitionedSignal(XLEN)
            self.RB = PartitionedSignal(XLEN)
            self.RT = PartitionedSignal(XLEN)

    def elaborate(self, platform):
        m = Module()
        with simd_scope(self, IntElWid, m=m):
            wide_RA = PartitionedSignal(unsigned(4 + XLEN))
            wide_RB = PartitionedSignal(unsigned(4 + XLEN))
            sum = PartitionedSignal(unsigned(4 + XLEN))
            carries = PartitionedSignal(unsigned(4 + XLEN))
            ones = PartitionedSignal(XLEN)
            nibbles_need_sixes = PartitionedSignal(XLEN)
            z4 = Const(0, 4)
            m.d.comb += [
                wide_RA.eq(Cat(self.RA, z4)),
                wide_RB.eq(Cat(self.RB, z4)),
                sum.eq(wide_RA + wide_RB),
                carries.eq(sum ^ wide_RA ^ wide_RB),
                ones.eq(Repl(Const(1, 4), XLEN // 4)),
                nibbles_need_sixes.eq(~carries[0:XLEN-1] & ones),
                self.RT.eq(nibbles_need_sixes * 6),
            ]
        return m

> 
>    (making sure in the process that at every single step of those
>     changes that the code still operates when PSpec "mode==scalar,
>     please use Signal not PartitionedSignal throughout the entire
>     ALU codebase" is set)

well, when pspec==scalar, all that needs to happen is part_sizes is set to {FpElWid.F64: 1} or {IntElWid.I64: 1}, then only the 64-bit entry in dicts are used and we produce something that should be 100% the same circuit as scalar (except for some names and module boundaries). no need to have everything replace itself with Signal.
> 
> bottom line this is all about making the bare minimum code-changes,
> across what is already an extremely complex set of inter-connected
> parts.

Comment 29 Luke Kenneth Casson Leighton 2021-10-07 20:51:55 BST

(In reply to Jacob Lifshay from comment #28)
> (In reply to Luke Kenneth Casson Leighton from comment #27)
> > 2) we make the tough decision to break the rule "no changes to any
> >    pipeline code" and AVOID the need entirely.
> 
> Well, we'll need to break that rule anyway for FP pipelines, which isn't
> much of an additional problem since they still all need modification anyway
> to add F16/BF16

FP16 was already done, it was possible to do full coverage using it, and
found bugs which, through parameterisation, meant that FP32 and FP64 didn't
have them either.

BF16 is a one-liner in the dictionary i mentioned (and can't remember where
it is).  unit tests are 

> and to handle flags/exceptions.

that's orthogonal and a separate series of patches that have no bearing
on this.

> Idea: have a XLEN global that is an instance of a new class SimdMap (a
> generalization of SimdLayout's dict handling code to arbitrary values 

jacob: again, that's *another* patch - a seriously big one.  it can
be considered.... *AFTER* the completion of the first (minimalist)
set of changes.

the other idea would mean that the ALUs become critically dependent
on PartitionedSignal, which may *or may not* have the full capabilities
of Signal (Array), and has *not* been designed to be a "scalar Signal
replacement".

it always creates sub-modules for example.  none of those modules
have been tested with zero-length (no) partitioning.

if used as a single-scalar it would result in significantly extra gates
and may not actually work with zero partitions (which is what a scalar
is).

that would therefore be extra work and, to remind you for about the
fiftieth time, we are under time and budgetary pressure.

please, please: *focus*.

the more time is spent discussing possible future designs and enhancements
instead of doing the work that's needed, the more i have to cut drastically
back on what can be done.

we

are

running

out 

of

time.

please.... *focus* on the task that's needed, and listen to the time-saving
risk-reducing techniques that took me over 20 years to learn, at considerable
cost, in my career.

Comment 30 Jacob Lifshay 2021-10-07 20:54:18 BST

(In reply to Luke Kenneth Casson Leighton from comment #27)
> (In reply to Jacob Lifshay from comment #23)
> > this is a misunderstanding, which i'm assuming by your later comments you
> > figured out...
> 
> i hadn't.

oh, you still misunderstand...
> 
> > The code creates *one* PartitionPoints based on an input elwidth
> > signal/value.
> 
> the usage for-loop confused me from the examples, when it set that
> elwidth to zero width:
> 
> for i in range(4):
>     pprint((i, layout(i, part_counts, unsigned(3))))
> 
> if i am reading you correctly, that's a zero-width Signal.

there aren't actually any signals there...if there were a signal, it'd be like:
for i in range(4):
    m = Module()
    elwid = Signal(2)
    pp,b,c,d,e = layout(elwid, part_counts, unsigned(3))
    def process():
        yield elwid.eq(i)
        pp = yield from pp.get_values() # get nmigen to evaluate pp
        pprint((i, (pp,b,c,d,e)))
    sim = Simulator(m)
    sim.add_process(process)
    sim.run()

the elwid input parameter is a nmigen Value, not a Shape. This is what type annotations would have told you...though it does also need docs...i got it to work right before our meeting and ran out of time.

Comment 31 Jacob Lifshay 2021-10-07 21:25:25 BST

(In reply to Luke Kenneth Casson Leighton from comment #27)
> (In reply to Jacob Lifshay from comment #23)
> 
> > why go through all that when you can just derive the signal width from the
> > lane shapes -- avoiding wasting extra bits or running out of bits, as well
> > as avoiding the need to manually specify size? Having the size be computed
> > doesn't really cause problems imho...
> 
> because the idea is to do a literal global/search/replace "Signal"
> with "PartitionedSignal" right throughout the entirety of the ALU
> codebase.
> 
> and the input argument on all current-Signals is not an elwidth, 
> it's an "overall width".

I'm 99% sure you're mis-identifying that...the input argument for all current Signals *is* the bit-width of the current lane (aka. elwidth or XLEN) except that our code currently is specialized for the specific case of elwidth=64. This is exactly how all SIMT works (which is exactly what we're trying to do with transparent vectorization). The types and sizes are the type/size of a *single* lane, not all-lanes-mushed-together.
> 
> (actually, have a PSpec which dynamically determines at run/compile
> time *which* class to use: Signal or PartitionedSignal).

That's just like how SIMT programs can be run un-vectorized.
> 
> when i said "there will be NO fundamental changing of the ALU code"
> i REALLY meant it.

which is why we need to go all-in on SIMT, since that is the *one vectorization paradigm* that requires minimal modifications to our ALU code.

Comment 32 Luke Kenneth Casson Leighton 2021-10-07 22:18:56 BST

(In reply to Jacob Lifshay from comment #30)
 
> there aren't actually any signals there...if there were a signal, it'd be
> like:
> for i in range(4):
>     m = Module()
>     elwid = Signal(2)
>     pp,b,c,d,e = layout(elwid, part_counts, unsigned(3))
>     def process():
>         yield elwid.eq(i)
>         pp = yield from pp.get_values() # get nmigen to evaluate pp
>         pprint((i, (pp,b,c,d,e)))
>     sim = Simulator(m)
>     sim.add_process(process)
>     sim.run()

ah *ha*!  ok.

now i get it.

ok so what i envisaged would be two separate functions (one which allocates
an actual pp mask after determining the length, followed by a *second* one
that would involve a Switch statement on the elwidths, you came up with
an elegant solution that merged the two.

the key function around which that hinges is add_p().

nice.

now.

what are the implications of that being a nmigen AST
code-fragment rather than a plain slice on a Signal...
i'll have to think that through.

 
> the elwid input parameter is a nmigen Value, 

yep, totally got it.  which means if the elwidth is an
Enum it still works, and if it's a 3 bit or 4 bit Signal
it *still* works.

like it.

Comment 33 Luke Kenneth Casson Leighton 2021-10-07 22:55:52 BST

(In reply to Jacob Lifshay from comment #31)
> (In reply to Luke Kenneth Casson Leighton from comment #27)
> > (In reply to Jacob Lifshay from comment #23)
> > 
> > > why go through all that when you can just derive the signal width from the
> > > lane shapes -- avoiding wasting extra bits or running out of bits, as well
> > > as avoiding the need to manually specify size? Having the size be computed
> > > doesn't really cause problems imho...
> > 
> > because the idea is to do a literal global/search/replace "Signal"
> > with "PartitionedSignal" right throughout the entirety of the ALU
> > codebase.
> > 
> > and the input argument on all current-Signals is not an elwidth, 
> > it's an "overall width".
> 
> I'm 99% sure you're mis-identifying that...

yes i was. i've got it now.

> the input argument for all
> current Signals *is* the bit-width of the current lane (aka. elwidth or
> XLEN) except that our code currently is specialized for the specific case of
> elwidth=64. 

yes.  in the discussions with Paul and Toshaan i seriously considered
an XLEN parameter in the HDL which would propagate from runtime through
a PSpec (see test_issuer.py for an example) and would allow us to test
a scalar 32 bit Power ISA core, see how many less gates are needed.
and, just for laughs, to try and XLEN=16 core.

but... time being what it is...

> This is exactly how all SIMT works (which is exactly what we're
> trying to do with transparent vectorization). The types and sizes are the
> type/size of a *single* lane, not all-lanes-mushed-together.

there is a lot of misinformation about SIMT.  SIMT is standard cores
(which may or may not have Packed SIMD ALUs) where they are "normal"
cores in every respect *except* that they share one single PC,
one single L1 I-cache, one single fetch and decoder, that *broadcasts*
in a synchronous fashion that one instruction to *all* cores.

the upshot of such a design is that whilst it allows for a much higher
computational gate density (only one L1 cache per dozens of cores) it
is an absolute bitch and a half to program.

not only that, but if you want to run standard general purpose POSIX
applications, you can't.  you have to *disable* all but one of the
cores on the broadcast bus, which of course means absolute rubbish
performance.

i an not seeing how such an architecture would help, here, and
for the above *general purpose* performance and compiler hassle
reasons would be very much against pursuing this design as a very
first processor.

> which is why we need to go all-in on SIMT, since that is the *one
> vectorization paradigm* that requires minimal modifications to our ALU code.

only if it is acceptable to have the punishment that comes with
SIMT, and to then have the exact same XLEN problem unless going
with Packed SIMD ALUs, at which point we are back to the exact
same issue (with the added complication of degraded general performance
and a hell of a lot of compiler work)

or,

having special separate scalar XLEN=8 and XLEN=16 and separate
XLEN=32 scalar cores which are disabled/idle when not given
suitable work, which is a variant on the entire reason why
PartitionedSignal was created in the first place, to avoid
exactly that kind of duplication.

Comment 34 Luke Kenneth Casson Leighton 2021-10-07 23:33:30 BST

(In reply to Luke Kenneth Casson Leighton from comment #32)

> what are the implications of that being a nmigen AST
> code-fragment rather than a plain slice on a Signal...
> i'll have to think that through.

got it.

it wasn't actually whether it's a code-fragment or not at all
(which comes with inherent risks of AST Expression duplication
which yosys can *in no way* properly handle and optimise away)

it was the submodules.

example:

https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_ass/assign.py;h=00cebc6dab33a25f93d4e790dd7883fc859bae9e;hb=HEAD#l88

that Switch, to support elwidths, would become Switch(self.elwidth)

the next problem is: how do you now reduce the number of Case() statements
needed to four, or, more to the point, on doing so which bits of data do
you copy from which partition, without the prerequisite information?

turns out that *if* you have the binary values (integer values) associated
with the respective elwidth value, that is exactly what is needed.

which is why i asked for it (as the 2nd function).

now, i believe it may be possible to adapt layout() very easily, by making
it a two-stage process.

stage 1:

add_p becomes:

    points[p] = points.get(p, []).append(i)

stage 2:

    a post-processing phase that goes through each list:
    points[p] = map(lambda i: elwidth == i, points[p])
    points[p] = reduce(points[p], operator.or_)
    (or better, use nmutil.treereduce, there)

stage 3:

   a second postprocess on a copy of points which does
   a bitwise shift-and-or
    points[p] = map(lambda i: 1<<i, points[p])
    points[p] = reduce(points[p], operator.or_)

which i *think* should create the binary values needed
which represent the settings of when elwidth==i

urrr...

maybe not.

no.  drat.  it's the order, the bit-position, depending on
the length of p.

something more like:

  plist = list(points.keys())
  for i in range(max(elwidths)):
    bitp[i] = 0
    for p, elwidths in points.values():
    if i in elwidths:
       bitpos = plist.index(p)
       bitp[i] |= 1<< bitpos

that's more like it.

that gives a binary value which should match exactly with
the PPs that light up should an elwidth be set to any of its
four values.

Comment 35 Jacob Lifshay 2021-10-08 00:06:40 BST

(In reply to Luke Kenneth Casson Leighton from comment #33)
> (In reply to Jacob Lifshay from comment #31)
> 
> > This is exactly how all SIMT works (which is exactly what we're
> > trying to do with transparent vectorization). The types and sizes are the
> > type/size of a *single* lane, not all-lanes-mushed-together.
> 
> there is a lot of misinformation about SIMT.  SIMT is standard cores
> (which may or may not have Packed SIMD ALUs) where they are "normal"
> cores in every respect *except* that they share one single PC,
> one single L1 I-cache, one single fetch and decoder, that *broadcasts*
> in a synchronous fashion that one instruction to *all* cores.

I don't care how SIMT is traditionally implemented in a GPU, that's totally irrelevant and not what I intended. What I meant is that our HDL would be written like how SIMT is used from a game programmer's perspective -- where if a game programmer writes:
float a, b;
int c, d;
...
a = a > b ? c : d;

the gpu actually runs (vectorized with 64 lanes):
f32x64 a, b;
i32x64 c, d, muxed;
boolx64 cond;
...
cond = a > b; // lane-wise compare
muxed = mux(cond, c, d);
a = convert<f32x64>(muxed);

and if a programmer were to write (generalizing a bit to support dynamic XLEN):
xlen_int_t a, b, c, d;
...
a = b + c * d;

what would actually run is:
vec_with_xlen_lanes_t a, b, c, d;
...
for(lane_t lane : currently_active_lanes()) { // C++11 for-each
    xlen_int_t p = c[lane] * d[lane];
    a[lane] = b[lane] + p;
}

> > the input argument for all
> > current Signals *is* the bit-width of the current lane (aka. elwidth or
> > XLEN) except that our code currently is specialized for the specific case of
> > elwidth=64. 
> 
> yes.  in the discussions with Paul and Toshaan i seriously considered
> an XLEN parameter in the HDL which would propagate from runtime through
> a PSpec (see test_issuer.py for an example) and would allow us to test
> a scalar 32 bit Power ISA core, see how many less gates are needed.
> and, just for laughs, to try and XLEN=16 core.
> 
> but... time being what it is...

If we want any chance of matching the spec pseudo-code, we will need
xlen-ification of our HDL in some form, since currently it's hardwired for only xlen=64.

Why not make it look more like the pseudo-code with arithmetic on an XLEN constant? I could write the class needed for XLEN in a day, it's not that complicated:

class SimdMap:
    def __init__(self, values, *, convert=True):
        if convert:
            # convert values to a dict by letting map do all the hard work
            values = SimdMap.map(lambda v: v, values).values
        self.values = values

    @staticmethod
    def map(f, *args):
        # TODO: fix bad wording, like builtin map, but for
        # SimdMap instead of iterables
        """apply a function `f` to arguments which are one of:
        * Mapping with ElWid-typed keys 
        * SimdMap instance
        * or scalar

        return a SimdMap of the results
        """
        retval = {}
        for i in ElWid:
            mapped_args = []
            for arg in args:
                if isinstance(arg, SimdMap):
                    arg = arg.values[i]
                elif isinstance(arg, Mapping):
                    arg = arg[i]
                mapped_args.append(arg)
            retval[i] = f(*mapped_args)
        return SimdMap(retval, convert=False)

    def __add__(self, other):
        return SimdMap.map(operator.add, self, other)

    def __radd__(self, other):
        return SimdMap.map(operator.add, other, self)

    def __sub__(self, other):
        return SimdMap.map(operator.sub, self, other)

    def __rsub__(self, other):
        return SimdMap.map(operator.sub, other, self)

    def __mul__(self, other):
        return SimdMap.map(operator.mul, self, other)

    def __rmul__(self, other):
        return SimdMap.map(operator.mul, other, self)

    def __floordiv__(self, other):
        return SimdMap.map(operator.floordiv, self, other)

    def __rfloordiv__(self, other):
        return SimdMap.map(operator.floordiv, other, self)

    ...

XLEN = SimdMap({ElWid.I8: 8, ElWid.I16: 16, ElWid.I32: 32, ElWid.I64: 64})

# layouts really should be a subclass or wrapper over SimdMap
# with Shapes as values, but lkcl insisted...
def layout(elwid, part_counts, lane_shapes):
    lane_shapes = SimdMap.map(Shape.cast, lane_shapes).values
    signed = lane_shapes[ElWid.I64].signed
    # rest unmodified...
    assert all(i.signed == signed for i in lane_shapes.values())
    part_wid = -min(-lane_shapes[i].width // c for i, c in part_counts.items())
    ...


...

# now the following works, because PartitionedSignal uses SimdMap on
# inputs for shapes, slicing, etc.

# example definition for addg6s, basically directly
# translating pseudo-code to nmigen+simd.
# intentionally not using standard ALU interface, for ease of exposition:
class AddG6s(Elaboratable):
    def __init__(self):
        with simd_scope(self, IntElWid, make_elwid_attr=True):
            self.RA = PartitionedSignal(XLEN)
            self.RB = PartitionedSignal(XLEN)
            self.RT = PartitionedSignal(XLEN)

    def elaborate(self, platform):
        m = Module()
        with simd_scope(self, IntElWid, m=m):
            wide_RA = PartitionedSignal(unsigned(4 + XLEN))
            wide_RB = PartitionedSignal(unsigned(4 + XLEN))
            sum = PartitionedSignal(unsigned(4 + XLEN))
            carries = PartitionedSignal(unsigned(4 + XLEN))
            ones = PartitionedSignal(XLEN)
            nibbles_need_sixes = PartitionedSignal(XLEN)
            z4 = Const(0, 4)
            m.d.comb += [
                wide_RA.eq(Cat(self.RA, z4)),
                wide_RB.eq(Cat(self.RB, z4)),
                sum.eq(wide_RA + wide_RB),
                carries.eq(sum ^ wide_RA ^ wide_RB),
                ones.eq(Repl(Const(1, 4), XLEN // 4)),
                nibbles_need_sixes.eq(~carries[0:XLEN-1] & ones),
                self.RT.eq(nibbles_need_sixes * 6),
            ]
        return m

Comment 36 Jacob Lifshay 2021-10-08 01:36:29 BST

(In reply to Luke Kenneth Casson Leighton from comment #34)
> (In reply to Luke Kenneth Casson Leighton from comment #32)
> 
> > what are the implications of that being a nmigen AST
> > code-fragment rather than a plain slice on a Signal...
> > i'll have to think that through.
> 
> got it.
> 
> it wasn't actually whether it's a code-fragment or not at all
> (which comes with inherent risks of AST Expression duplication
> which yosys can *in no way* properly handle and optimise away)

only if you never run `opt`...expression deduplication is a pretty trivial compiler optimization. In my experience, yosys deduplicates built-in expressions just fine.
> 
> it was the submodules.

we can deduplicate expressions ourselves pretty easily:
def deduped(f):
    map = {}
    def wrapper(*args, **kwargs):
        key0 = []
        for arg in args:
            key0.append(id(arg))
        key1 = []
        for k, v in kwargs.items():
            key1.append((id(k), id(v)))
        key = tuple(key0), tuple(key1)
        if key in map:
            return map[key][0]
        retval = f(*args, **kwargs)
        # keep reference to args and kwargs to avoid ids
        # getting reused for something else. we can't use weakref,
        # too many things don't work with it such as `list` and `dict`
        map[key] = retval, args, kwargs
        return retval
    return wrapper
    
class PartitionedSignal:
    ...
    @deduped
    def __add__(self, other):
        ...

    @deduped
    def __sub__(self, other):
        ...

    ...

> 
> example:
> 
> https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_ass/
> assign.py;h=00cebc6dab33a25f93d4e790dd7883fc859bae9e;hb=HEAD#l88
> 
> that Switch, to support elwidths, would become Switch(self.elwidth)
> 
> the next problem is: how do you now reduce the number of Case() statements
> needed to four, or, more to the point, on doing so which bits of data do
> you copy from which partition, without the prerequisite information?

easy, each `part_wid` wide bit slice in any PartitionedSignal can be independently muxed -- as long as the condition is first converted to the layout:
{
    ElWid.I8: 1,
    ElWid.I16: 1,
    ElWid.I32: 1,
    ElWid.I64: 1,
}
with padding being a sign-extended version of each lane... Aka. the result of PartitionedSignal.bool(). (or what *should* be the result if design were left to me).

lanes for elwid=16
  0    1    2    3    4    5    6    7
[-u1-pppp][-u1-pppp][-u1-pppp][-u1-pppp]
example value for that layout for elwid==16 and lanes of False, True, False, True:
0 1 2 3 4 5 6 7
0 0 1 1 0 0 1 1

m.Case simply is an If/Elif that uses the results of PartitionedSignal.matches() instead of .bool().

a = PartitionedSignal(XLEN - 2)
b = PartitionedSignal(3)
with m.If(a):
    with m.Switch(b):
        with m.Case('1-0'):
            m.d.comb += c.eq(d)
with m.Elif(b == 1):
    m.d.sync += e.eq(f)

compiles to (full trace, with unnecessary assignments omitted):
a = PartitionedSignal(XLEN - 2)
b = PartitionedSignal(3)
# If.enter
if_cond = cond_stack[-1] & a.bool()
cond_stack.append(if_cond)
# If body
# Switch.enter
switch_value_stack.append(b)
# Switch body
# Case.enter
case_cond = cond_stack[-1] & switch_value_stack[-1].matches('1-0')
cond_stack.append(case_cond)
# Case body
# assign c.eq(d)
rhs = d.cast_to(c.layout)
for i in range(part_count):
    with m.If(cond_stack[-1].sig[i]): # cond_stack all have part_wid == 1
        # slice computation should just be c.layout.part_slice(i)
        start = c.layout.part_wid * i
        s = slice(start, start + c.layout.part_wid)
        m.d.comb += c.sig[s].eq(rhs.sig[s])
# Case.exit
cond_stack.pop()
# Switch.exit
switch_value_stack.pop()
# If.exit
else_cond = ~cond_stack.pop()
else_cond &= cond_stack[-1]
# Elif.enter
if_cond = else_cond & (b == 1).bool()
cond_stack.append(if_cond)
# Elif body
# assign e.eq(f)
rhs = f.cast_to(e.layout)
for i in range(part_count):
    with m.If(cond_stack[-1].sig[i]): # cond_stack all have part_wid == 1
        # slice computation should just be e.layout.part_slice(i)
        start = e.layout.part_wid * i
        s = slice(start, start + e.layout.part_wid)
        m.d.sync += e.sig[s].eq(rhs.sig[s])
# Elif.exit

Comment 37 Jacob Lifshay 2021-10-08 01:42:13 BST

(In reply to Jacob Lifshay from comment #36)

> # assign e.eq(f)
> rhs = f.cast_to(e.layout)
> for i in range(part_count):
>     with m.If(cond_stack[-1].sig[i]): # cond_stack all have part_wid == 1
>         # slice computation should just be e.layout.part_slice(i)
>         start = e.layout.part_wid * i
>         s = slice(start, start + e.layout.part_wid)
>         m.d.sync += e.sig[s].eq(rhs.sig[s])
> # Elif.exit

This works because, for each lane that is enabled, all parts are assigned due to all bits in the condition for that lane being identical due to being a sign-extended bool. padding being assigned along with their corresponding lanes isn't a problem.

Comment 38 Luke Kenneth Casson Leighton 2021-10-08 20:44:19 BST

(In reply to Jacob Lifshay from comment #35)

> I don't care how SIMT is traditionally implemented in a GPU, that's totally
> irrelevant and not what I intended.

yeah i understood that. i think you likely meant "Packed SIMD"

> What I meant is that our HDL would be
> written like how SIMT is used from a game programmer's perspective 

ok.  yes. this is more like SIMD programming.

> what would actually run is:
> vec_with_xlen_lanes_t a, b, c, d;
> ...
> for(lane_t lane : currently_active_lanes()) { // C++11 for-each
>     xlen_int_t p = c[lane] * d[lane];
>     a[lane] = b[lane] + p;
> }

yep. definitely SIMD (at the backend) which is what we're discussing:
the low-level hardware which is behind "lanes".

 
> If we want any chance of matching the spec pseudo-code, we will need
> xlen-ification of our HDL in some form, since currently it's hardwired for
> only xlen=64.

right.  ok.  so here, temporarily (i quite like the self.XLEN idea but it
is a lot of work to deploy), to *not* have to do that deployment immediately,
the trick that can be played is:

    set the SIMD ALU width to *exactly* the same width as the current scalar
    width i.e. 64.

ta-daa, problem goes away.

is it a cheat? yes.
will it work? yes.
will it save time? yes.
will it minimise code-changes? yes.
will it do absolutely everything? mmmm probably not but that
can be evaluated case by case (incremental patches)

in other words: by defining the SIMD ALU width exactly equal to the
current scalar ALU width this makes what you would like to see (and
called erroneously SIMT) exactly what i have been planning for 18 months.

originally if you recall (going back 2+ years) the idea was to *split*
the ALUs into *two* 32 bit ALUs that cooperate to do 64 bit arithmetic.
but that was actually more about the regfiles, a HI32 regfile and a LO32
regfile.

this melted my brain and for practical time reasons it's all 64 bit.


> Why not make it look more like the pseudo-code with arithmetic on an XLEN
> constant? I could write the class needed for XLEN in a day, it's not that
> complicated:

yehyeh.  my feeling is, this class should derive from Shape.  in fact it
may have to (i looked at ast.py and there are assumptions that the
first argument of e.g. Signal is a Shape() or will be converted to one.

by deriving *from* Shape() a self.XLEN can do the division etc.  self.XLEN//2
etc that we saw were needed in the pseudocode, they will almost certainly
also be needed in the HDL as well, but they would create or carry an appropriate
elwidth set as well.

the calculation in each of the operator overloads then is based on self.width
which comes from Shape.width.

i am not entirely certain of all the details here so kinda would like to
defer it as long as needed, and go with an "on-demand" approach here.

> XLEN = SimdMap({ElWid.I8: 8, ElWid.I16: 16, ElWid.I32: 32, ElWid.I64: 64})

i like it: you are however forgetting that the ALU width also has to be one
of the parameters.
 
this is a hard requirement.

i explained it in terms of the layout() function returning different
widths based on its elwidth parameters but you haven't yet taken on board
the significance of the (long) chain-of-logic that goes back to
PartitionedSignal internals.

if the width varies it creates a massive cost in gate terms by having to
Mux e.g. 22 bit PartitionedSignals onto 64 bit PartitionedSignals.

if the width does not vary then those are straight wires.

yes there wull be blank unused partitions but i worked out how to deal
with that.

by analysing the dict of elwidth->binary mask values and working out those
unused partition pieces, you can actually get all of the submodules behind
the various operators to not even bother allocating gates for that piece.

example:

* PartitionedEq works by creating a set of eqs per piece
  then *combining* those pieces together depending on the
  PPoints
* if we KNOW that some of those pieces will never be used
  then just don't do the sub-eqs for those pieces.

ta-daa, less gates all round.

but for this to work the layout() function *has* to have the width parameter
as input, hence the assertions, and that's perfectly fine.

we have to prioritise gate count, here, not programmer convenience.


> # layouts really should be a subclass or wrapper over SimdMap
> # with Shapes as values, but lkcl insisted...
> def layout(elwid, part_counts, lane_shapes):
>     lane_shapes = SimdMap.map(Shape.cast, lane_shapes).values

still doesn't have width which is a hard requirement.
> # now the following works, because PartitionedSignal uses SimdMap on
> # inputs for shapes, slicing, etc.
> 
> # example definition for addg6s, basically directly
> # translating pseudo-code to nmigen+simd.
> # intentionally not using standard ALU interface, for ease of exposition:
> class AddG6s(Elaboratable):
>     def __init__(self):
>         with simd_scope(self, IntElWid, make_elwid_attr=True):
>             self.RA = PartitionedSignal(XLEN)
>             self.RB = PartitionedSignal(XLEN)
>             self.RT = PartitionedSignal(XLEN)

it will have to be

     with simdscope(self.pspec, m, self, ....) as ss:
        self.RA = ss.Signal(pspec.XLEN)

which allows simdscope to "pick up" the dynamic runtime compile switch
between scalar and simd, but yes.

my only concern is that even flipping all the HDL over to substitute XLEN
for 64 throughout all 12 pipelines is a big frickin job, at least 10-14 days.

it's not something to be taken lightly, and *nothing* else can take place during that time.

this is really important.

once committed to big regular patterned changes like that, they *have* to be
seen through to the end and VERY strictly under no circumstances mixed with any other code changes.

the individual pipeline tests as well as test_issuer will all have to be kept uptodate.

hence why it is a lot of work because we are talking *TWENTY FIVE OR MORE*
unit tests, 12 pipelines, and 5 to 8 ancillary files associated with the
core and with regfiles.

Comment 39 Luke Kenneth Casson Leighton 2021-10-08 22:12:01 BST

jacob i am catching up, i am recovering from chronic
pain yesterday and had to do the OpenPOWER Course video today.

Comment 40 Luke Kenneth Casson Leighton 2021-10-08 22:43:52 BST

(In reply to Jacob Lifshay from comment #36)
 
> only if you never run `opt`...expression deduplication is a pretty trivial
> compiler optimization. In my experience, yosys deduplicates built-in
> expressions just fine.

not on TestIssuer or other complex desigbs it doesn't


> > 
> > it was the submodules.
> 
> we can deduplicate expressions ourselves pretty easily:

ooo niiiice, that's really valuable and exciting.

i bet a similar trick would be possible to deploy for autocreating
intermediary temporary Signals.

can you drop this into nmutil?

> > the next problem is: how do you now reduce the number of Case() statements
> > needed to four, or, more to the point, on doing so which bits of data do
> > you copy from which partition, without the prerequisite information?
> 
> easy,

it was a rhetorical-ish question as part of the reasoning chain, but hey :)
let's go through it.

> each `part_wid` wide bit slice in any PartitionedSignal can be
> independently muxed -- as long as the condition is first converted to the
> layout:
> {
>     ElWid.I8: 1,
>     ElWid.I16: 1,
>     ElWid.I32: 1,
>     ElWid.I64: 1,
> }
> with padding being a sign-extended version of each lane... Aka. the result
> of PartitionedSignal.bool(). (or what *should* be the result if design were
> left to me).

ah.  ok, so with so much code already having been written, and it taking 5
months, and you weren't contributing at the time, we have to be pragmatic
and fit with the existing design (and, it's simply good engineering
practice to finish something first rather than discard it 90% the
way to completion)

redesigning the entire suite of submodules (12 and counting) to fit elwidth
is both unrealistic and undesirable, but there *is* a way to retrofit by
having an adapter - layout() - create the values that the PartPoints
mask *should* be set to for any given elwidth.

in this way we do *not* have to discard 5 months of work.

the example you gave, i apologise i couldn't follow it, it appears
to be a different design paradigm which is not the focus of the
discussion.

tomorrow i will drop the layout() function into the mix and i think
i have enough of a handle on things to convert to elwidth.

however because it would involve a massive duplication of the
test suite, or a disruptive alteration of the unit tests, i'll do it
as a separate class that inherits from PartitionedSignal, and
think about how to go from there

Comment 41 Luke Kenneth Casson Leighton 2021-10-08 22:47:31 BST

(In reply to Jacob Lifshay from comment #37)

> This works because, for each lane that is enabled, all parts are assigned
> due to all bits in the condition for that lane being identical due to being
> a sign-extended bool. padding being assigned along with their corresponding
> lanes isn't a problem.

as a general rule of software engineering, "elegant" tricks either
need to be explicitly documented, or simply avoided altogether.

the reason is real simple: they're unmaintainable by the average
engineer, who doesn't dare touch it for fear of breaking something.

Comment 42 Jacob Lifshay 2021-10-08 23:30:16 BST

(In reply to Luke Kenneth Casson Leighton from comment #38)
> (In reply to Jacob Lifshay from comment #35)
> 
> > I don't care how SIMT is traditionally implemented in a GPU, that's totally
> > irrelevant and not what I intended.
> 
> yeah i understood that. i think you likely meant "Packed SIMD"

no, I meant SIMT -- explicitly where the programmer writes scalar code and the code is auto-vectorized to a GPU-driver-chosen lane count behind-the-scenes, where there's one instance of the scalar code for each vector lane -- like posix threads in that you can have one instance of the scalar code per thread, and they all run in parallel. In our case, that lane count is determined by the scoped globals. "Packed SIMD" the programmer still has to choose explicit SIMD register widths, explicit masking/predication, explicit SIMD types, etc.

> > What I meant is that our HDL would be
> > written like how SIMT is used from a game programmer's perspective 
> 
> ok.  yes. this is more like SIMD programming.

not really, it's just somewhat confusing because game programmers often use explicit short vectors too -- kinda orthogonally to the SIMT vectors.

> 
> > what would actually run is:
> > vec_with_xlen_lanes_t a, b, c, d;
> > ...
> > for(lane_t lane : currently_active_lanes()) { // C++11 for-each
> >     xlen_int_t p = c[lane] * d[lane];
> >     a[lane] = b[lane] + p;
> > }
> 
> yep. definitely SIMD (at the backend) which is what we're discussing:
> the low-level hardware which is behind "lanes".

yes, but critically, the vectorization into SIMD step can be skipped (either by setting lane counts to always be one (by setting part_counts={ElWid.I64: 1}), or by bypassing our SIMD operators completely and generating nmigen operators instead) and you end up with the original scalar code where, hey, it uses XLEN everywhere...just tell it to just use the component of XLEN where elwid=64.

This allows us to convert everything one module at a time, since SIMT-ified code can easily be used by non-SIMT code by having the globals indicate elwid=64 and part_counts={ElWid.I64: 1}.
> 
>  
> > If we want any chance of matching the spec pseudo-code, we will need
> > xlen-ification of our HDL in some form, since currently it's hardwired for
> > only xlen=64.
> 
> right.  ok.  so here, temporarily (i quite like the self.XLEN idea but it
> is a lot of work to deploy), to *not* have to do that deployment immediately,
> the trick that can be played is:
> 
>     set the SIMD ALU width to *exactly* the same width as the current scalar
>     width i.e. 64.

so, basically you want all code to magically replace 64 with XLEN, and you try to justify that by saying 64 is the SIMD width...

I say, that justification is additional unnecessary complication and mental burden, if we take that approach we should still have 64-replaced-with-XLEN be the *lane* width, not the SIMD width.

I am taking SIMT seriously and we should write our code such that the SIMD width should essentially not affect how any of our ALU HDL is written. If it turns out that the circuits are smaller if we choose a fixed SIMD width, then that can be handled by the layout algorithm (where that kind of decision belongs) and passed in through another scoped global, *not* by conflating lane-width=64 with SIMD-width=64. That allows us to:
1. set simd-width=128 if we decide 64-bit is too small.
2. set simd-width=32 and limit to max elwid of 32 for a small embedded cpu.
3. set simd-width to some more complex formula based on requested lane sizes so we don't outright fail where our code needs lanes with more than 64-bits, such as int/fp multiplication, or addg6s.

I think the magic replace 64 with XLEN *could* work, but just cuz it can work doesn't mean it's a good idea, it makes our code much harder to understand, and it causes issues where we *really need* lane size = 64 not XLEN (such as in the bcd<->dpd instructions). Textually replacing 64 with XLEN when we replace Signal with PartitionedSignal isn't much harder, just one more search-replace-and-check-all-replacements -- that works just as well as the magically replace 64 could possibly work, and it leaves the code waay more understandable.

> will it save time? yes.

yes, but not much.

> will it minimise code-changes? yes.

yes, but imho code understandability is faar more important than tiny git diffs. think of it as just a refactoring, big diffs don't mean it's that hard to understand.


> will it do absolutely everything? mmmm probably not but that
> can be evaluated case by case (incremental patches)
> 
> in other words: by defining the SIMD ALU width exactly equal to the
> current scalar ALU width this makes what you would like to see (and
> called erroneously SIMT) exactly what i have been planning for 18 months.
> 
> originally if you recall (going back 2+ years) the idea was to *split*
> the ALUs into *two* 32 bit ALUs that cooperate to do 64 bit arithmetic.
> but that was actually more about the regfiles, a HI32 regfile and a LO32
> regfile.
> 
> this melted my brain and for practical time reasons it's all 64 bit.
> 
> 
> > Why not make it look more like the pseudo-code with arithmetic on an XLEN
> > constant? I could write the class needed for XLEN in a day, it's not that
> > complicated:
> 
> yehyeh.  my feeling is, this class should derive from Shape.  in fact it
> may have to (i looked at ast.py and there are assumptions that the
> first argument of e.g. Signal is a Shape() or will be converted to one.

I think it should *specifically not* derive from Shape, because it's more general than just Shapes -- it can have values that are:
* ints -- how it's used in XLEN
* patterns -- for .matches() and Case()
* Values -- for when we're converting from scalars to SIMD and which scalar we use depends on elwid.
* Shapes -- input to layout()
* slices -- bit slicing
* more

> by deriving *from* Shape() a self.XLEN can do the division etc.

that has nothing to do with deriving from Shape.
> self.XLEN//2
> etc that we saw were needed in the pseudocode, they will almost certainly
> also be needed in the HDL as well, but they would create or carry an
> appropriate
> elwidth set as well.

Well, how i envisioned it is SimdMap has a separate value for each possible elwid, allowing us to calculate a different value for each possible elwid -- implemented internally using a dict, hence the Map in SimdMap.

The way you're thinking of likely won't actually work since elwid is a Signal, and you can't use a Signal as a Shape. If you base it on .width, then you've locked yourself into only being able to calculate the value for one specific elwid value, the one that matches .width.
> 
> the calculation in each of the operator overloads then is based on self.width
> which comes from Shape.width.
> 
> i am not entirely certain of all the details here so kinda would like to
> defer it as long as needed, and go with an "on-demand" approach here.
> 
> > XLEN = SimdMap({ElWid.I8: 8, ElWid.I16: 16, ElWid.I32: 32, ElWid.I64: 64})
> 
> i like it: you are however forgetting that the ALU width also has to be one
> of the parameters.

that's passed in via the scope globals, not via a SimdMap.

> this is a hard requirement.

i can see how varying SIMD-width could need more gates, i think that should be handled in layout via scope globals if necessary, not via messing up SimdMap.

SimdMap is purely a map from elwidths to Python values.

The type that *would* have SIMD-width, is SimdLayout, which layout() should be morphed into the __init__ for. SimdLayout could be a Shape, but I think it's likely better to have a .simd_shape() or just .shape() method... not deriving from Shape avoids possible confusion between the full-SIMD-width and the lane width.

There is a helper dataclass: SimdLane. it contains an ElWid enumerant and also a lane index or part index or something. The idea is that a SimdLane instance specifies a particular lane in a SimdSignal/SimdLayout.

SimdLayout would have fields, properties, or whatever, for all of:
* lane_shapes -- either a dict or a SimdMap, tbd.
* padding_kind -- is the padding a sign-extended version of
    the corresponding lane? is it zeros? ones?
    Repl(lane_bits, ...)? other? Defaults to other.
* bits_per_part -- each lane is made of 1 or more parts (determined
    by part_counts), all parts are always the same bit-width.
* signed -- signedness of all lanes
* simd_shape -- could always be unsigned, since
    signedness doesn't really matter here.
* part_counts -- number of parts per lane for each elwid,
    determines which elwid are supported. This, along with all
    the other __init__ parameters (excluding lane_shapes and
    padding_kind) should really be tossed into one globals
    object (it's SimdPartMode in simd_signal) that is the
    scoped global's value.
* other parameters that determine layout.

methods:
* get_lane_start(lane) -- returns the starting bit index of `lane`
    in the SIMD Signal. lane is a SimdLane.
* get_lane_width(lane, include_padding=False) -- returns the bit
    width of `lane` in the SIMD Signal. lane is a SimdLane.
* get_lane_slice(lane, include_padding=False) -- returns a slice()
    for the bits of `lane` in the SIMD Signal. lane is a SimdLane.
* lanes() -- gets all `SimdLane`s for this layout.
* lanes_for_elwidth(elwid) -- gets all `SimdLane`s for this layout
    for a particular ElWid enumerant.
* elwidths() -- gets all ElWid values supported by this layout
    ... essentially part_counts.keys()
* more...

> > # example definition for addg6s, basically directly
> > # translating pseudo-code to nmigen+simd.
> > # intentionally not using standard ALU interface, for ease of exposition:
> > class AddG6s(Elaboratable):
> >     def __init__(self):
> >         with simd_scope(self, IntElWid, make_elwid_attr=True):
> >             self.RA = PartitionedSignal(XLEN)
> >             self.RB = PartitionedSignal(XLEN)
> >             self.RT = PartitionedSignal(XLEN)
> 
> it will have to be
> 
>      with simdscope(self.pspec, m, self, ....) as ss:
>         self.RA = ss.Signal(pspec.XLEN)

ooh, lookie, simdscope gets self as an argument ... it can read self.pspec itself!

XLEN should be a global constant imho.

> which allows simdscope to "pick up" the dynamic runtime compile switch
> between scalar and simd, but yes.
> 
> my only concern is that even flipping all the HDL over to substitute XLEN
> for 64 throughout all 12 pipelines is a big frickin job, at least 10-14 days.

it happens at the same time as our replacing Signal with PartitionedSignal, so it won't be much more work, imho.

> it's not something to be taken lightly,

true.

> and *nothing* else can take place
> during that time.

false...luckily, if the globals set part_counts={ElWid.I64: 1}, then our modules should be nearly 100% compatible with not-yet-SIMT-ified modules, meaning that we can do it one module at a time.

> the individual pipeline tests as well as test_issuer will all have to be
> kept uptodate.
> 
> hence why it is a lot of work because we are talking *TWENTY FIVE OR MORE*
> unit tests, 12 pipelines, and 5 to 8 ancillary files associated with the
> core and with regfiles.

yes, I've done similarly-sized refactors on my own code before. they took several days.

Comment 43 Jacob Lifshay 2021-10-08 23:50:33 BST

(In reply to Luke Kenneth Casson Leighton from comment #40)
> (In reply to Jacob Lifshay from comment #36)
>  
> > only if you never run `opt`...expression deduplication is a pretty trivial
> > compiler optimization. In my experience, yosys deduplicates built-in
> > expressions just fine.
> 
> not on TestIssuer or other complex desigbs it doesn't

hmm, it probably only works within each module, not across modules.

> 
> > > 
> > > it was the submodules.
> > 
> > we can deduplicate expressions ourselves pretty easily:
> 
> ooo niiiice, that's really valuable and exciting.
> 
> i bet a similar trick would be possible to deploy for autocreating
> intermediary temporary Signals.
> 
> can you drop this into nmutil?

sure! Python's GC just had a heart-attack at the sight of it, but oh well. Good thing we don't have long-running programs that need PartitionedSignal.
> 
> > > the next problem is: how do you now reduce the number of Case() statements
> > > needed to four, or, more to the point, on doing so which bits of data do
> > > you copy from which partition, without the prerequisite information?
> > 
> > easy,
> 
> it was a rhetorical-ish question as part of the reasoning chain, but hey :)
> let's go through it.
> 
> > each `part_wid` wide bit slice in any PartitionedSignal can be
> > independently muxed -- as long as the condition is first converted to the
> > layout:
> > {
> >     ElWid.I8: 1,
> >     ElWid.I16: 1,
> >     ElWid.I32: 1,
> >     ElWid.I64: 1,
> > }
> > with padding being a sign-extended version of each lane... Aka. the result
> > of PartitionedSignal.bool(). (or what *should* be the result if design were
> > left to me).
> 
> ah.  ok, so with so much code already having been written, and it taking 5
> months, and you weren't contributing at the time, we have to be pragmatic
> and fit with the existing design (and, it's simply good engineering
> practice to finish something first rather than discard it 90% the
> way to completion)

Luckily, the exact bit layout resulting from bool matches the layout I wanted, all that needs to happen is to declare all but each lane's lsb as sign-extended padding bits by replacing the result's layout. no additional gates needed.
> 
> redesigning the entire suite of submodules (12 and counting) to fit elwidth
> is both unrealistic and undesirable, but there *is* a way to retrofit by
> having an adapter - layout() - create the values that the PartPoints
> mask *should* be set to for any given elwidth.

oh, yeah, part_points is another member needed for SimdLayout. oops, forgot that.

> in this way we do *not* have to discard 5 months of work.
> 
> the example you gave, i apologise i couldn't follow it, it appears
> to be a different design paradigm which is not the focus of the
> discussion.

oh, it is how to handle m.If/Switch/Case with part-wise assignment. it uses the approach of translating all but the final assignment to just data-flow ahead-of-time, letting yosys handle the final assignment translation.

> tomorrow i will drop the layout() function into the mix and i think
> i have enough of a handle on things to convert to elwidth.

oh, i can do that...i'll convert PartitionedSignal + unit tests to use SimdLayout. I'd guess it'll take me the rest of today. I'll be busy all weekend, so hopefully that won't be too much of a problem...

> however because it would involve a massive duplication of the
> test suite, or a disruptive alteration of the unit tests, i'll do it
> as a separate class that inherits from PartitionedSignal, and
> think about how to go from there

iirc nothing actually uses PartitionedSignal yet, so a API break to using SimdLayout shouldn't be as big a problem.

Comment 44 Jacob Lifshay 2021-10-08 23:55:13 BST

(In reply to Jacob Lifshay from comment #43)
> (In reply to Luke Kenneth Casson Leighton from comment #40)
> > tomorrow i will drop the layout() function into the mix and i think
> > i have enough of a handle on things to convert to elwidth.
> 
> oh, i can do that...i'll convert PartitionedSignal + unit tests to use
> SimdLayout. I'd guess it'll take me the rest of today. I'll be busy all
> weekend, so hopefully that won't be too much of a problem...

I'll push it to a separate branch to avoid breaking master if I don't finish before the weekend and to allow more review.

Comment 45 Luke Kenneth Casson Leighton 2021-10-09 01:03:14 BST

(In reply to Jacob Lifshay from comment #43)
 
> oh, i can do that...i'll convert PartitionedSignal + unit tests to use
> SimdLayout.

whoawhoa hang on, only do that if you're happy that SimdLayout should derive 
from ast.Shape, that layout() should take a dict *only* (not have signed/unsigned
because, obviously, by deriving from ast.Shape() SimdLayout *already has*
a signed member, and *only* if you can guarantee to have it 100% completed
within one day.

the reason is because if not i have to completely stop all ongoing
work on PartitionedSignal because it becomes too much to merge
new submodules into the altered code.

(i am currently investigating how to add __getattr__, Part, and Slice,
all of which are interrelated)

welcome to planning on large complex projects, sigh.

Comment 46 Luke Kenneth Casson Leighton 2021-10-09 01:07:16 BST

(In reply to Jacob Lifshay from comment #44)

> I'll push it to a separate branch to avoid breaking master if I don't finish
> before the weekend and to allow more review.

can i suggest first please do an ElwidthPartitionedSignal that derives from
PartitionedSignal, and use the layout() function as an adapter between the
two.  perhaps write a simple *small* proof-of-concept / demo unit test to
confirm functionality 

this will be a hell of a lot less disruptive to ongoing work on
PartitionedSignal and will allow some scheduling shenanigens to take
place

Comment 47 Luke Kenneth Casson Leighton 2021-10-09 01:13:49 BST

hit send too soon, doh.

from comment #34 the function at the end effectively does this (actuslly
this can be its unit test):

   ppoints, pdict = layout(64, elwidth, ...)
   for ew in range(4):
      yield elwidth.eq(ew)
      maskvals = yield Cat(ppoints.mask.values())
      assert maskvals == pdict[ew]

this is quite crucial

Comment 48 Luke Kenneth Casson Leighton 2021-10-09 02:12:46 BST

(In reply to Jacob Lifshay from comment #43)
 
> oh, it is how to handle m.If/Switch/Case with part-wise assignment. it uses
> the approach of translating all but the final assignment to just data-flow
> ahead-of-time, letting yosys handle the final assignment translation.

ok i suspected you might have jumped context, thank you for confirming.
i was referring to an internal use of a Type 2 (dsl.Module) Switch for
a submodule, *not* to the implementation *of* PartitionedSignal.__Switch__

the strategy there for PartitionedSignal.__Switch__ is, well, quite
mindbending, and also out of scope for thiw bugreport. for
completeness however here is the strategy, which should
be discussed there, not here
https://bugs.libre-soc.org/show_bug.cgi?id=458#c10
> iirc nothing actually uses PartitionedSignal yet, so a API break to using
> SimdLayout shouldn't be as big a problem.

except that there may be good reasons why other people (other customers
whom i cannot reveal or discuss because it is commercially confidential)
who might wish to use PartitionedSignal *as it is* i.e. who may not
wish to use Elwidths or who may wish to define their own
adapter or class on *top* of PartitionedSignal, or whom may
wish to have *all* the mask points available not the subset as
fictated by Elwidths.

basically i am saying it is too early to go destroying the
existing PartitionedSignal API and wholesale converting all
of the unit tests to ElwidthPartitionedSignal.

one *small* step at a time.

Comment 49 Jacob Lifshay 2021-10-09 02:14:32 BST

(In reply to Jacob Lifshay from comment #43)
> (In reply to Luke Kenneth Casson Leighton from comment #40)
> > (In reply to Jacob Lifshay from comment #36)
> > > we can deduplicate expressions ourselves pretty easily:
> > 
> > ooo niiiice, that's really valuable and exciting.
> > 
> > i bet a similar trick would be possible to deploy for autocreating
> > intermediary temporary Signals.
> > 
> > can you drop this into nmutil?
> 
> sure! Python's GC just had a heart-attack at the sight of it, but oh well.
> Good thing we don't have long-running programs that need PartitionedSignal.

Added:
https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;h=fbb284ecf93ead5d748f9c19e5b1b899a06c6b55

I also added support for having globals go into the key, since we need to include the scoped globals there:

my_global = 42

@deduped(global_keys=[lambda: my_global])
def fn_with_global(a, *, b=1):
    ...

so we could use it like:

class PartitionedSignal(...):
    ...
    @deduped(global_keys=[get_simd_scope_values])
    def __add__(self, rhs):
        ...
    ...

though to improve deduping, we may want to do:
class PartitionedSignal(...):
    ...
    @staticmethod
    @deduped(global_keys=[get_simd_scope_values])
    def _do_add_impl(*args):
        # put __add__ and __radd__'s guts here

    @staticmethod
    def _do_add(*args):
        args = list(args)
        # consistent argument order to handle commutativity
        args.sort(key=id)
        return PartitionedSignal._do_add_impl(*args)

    def __add__(self, other):
        return PartitionedSignal._do_add(self, other)

    def __radd__(self, other):
        return PartitionedSignal._do_add(other, self)
        
    ...


I also added support for opportunistically using weakref.

Comment 50 Luke Kenneth Casson Leighton 2021-10-09 02:20:11 BST

(In reply to Jacob Lifshay from comment #49)

> Added:
> https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;
> h=fbb284ecf93ead5d748f9c19e5b1b899a06c6b55

nice. that's really well documented.

can you raise a separate bugreport about it and vf here and also 721(?)
so they both can get a budget?

Comment 51 Jacob Lifshay 2021-10-09 02:29:22 BST

(In reply to Luke Kenneth Casson Leighton from comment #48)
> (In reply to Jacob Lifshay from comment #43)
> > iirc nothing actually uses PartitionedSignal yet, so a API break to using
> > SimdLayout shouldn't be as big a problem.
> 
> except that there may be good reasons why other people (other customers
> whom i cannot reveal or discuss because it is commercially confidential)

can you tell me privately? i was not aware of anyone using it...

I'm kinda inclined to say "you decided to use unstable not even released software, you reap the consequences of said software changing", but I recognize that that's probably not helpful...

> who might wish to use PartitionedSignal *as it is* i.e. who may not
> wish to use Elwidths or who may wish to define their own
> adapter or class on *top* of PartitionedSignal, or whom may
> wish to have *all* the mask points available not the subset as
> fictated by Elwidths.
> 
> basically i am saying it is too early to go destroying the
> existing PartitionedSignal API and wholesale converting all
> of the unit tests to ElwidthPartitionedSignal.

Well, if we do work in a separate branch, we can see if the design will pan out. if not, we can use pieces of the code in PartitionedSignal 3.0 (since simd_signal is kinda PartitionedSignal 2.0.0.alpha).

I think copying the existing PartitionedSignal and naming the copy ElwidthSimdSignal is better than having it be a subclass...subclassing will make a bunch of messy unnecessary interactions. Also, naming it Simd is waay shorter than Partitioned.

I did kinda run out of time today with getting @deduped to work, so will put off writing SimdLayout till next week. Since I'm the one pushing for it, I'll handle the work needed to integrate it.

Comment 52 Jacob Lifshay 2021-10-09 02:35:52 BST

(In reply to Luke Kenneth Casson Leighton from comment #50)
> (In reply to Jacob Lifshay from comment #49)
> 
> > Added:
> > https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;
> > h=fbb284ecf93ead5d748f9c19e5b1b899a06c6b55
> 
> nice. that's really well documented.
> 
> can you raise a separate bugreport about it and vf here and also 721(?)
> so they both can get a budget?

vf??? idk what that means...

I'll let you add the appropriate budget, see alsos, etc.

@deduped bug:
https://bugs.libre-soc.org/show_bug.cgi?id=722

Comment 53 Luke Kenneth Casson Leighton 2021-10-09 07:23:56 BST

(In reply to Jacob Lifshay from comment #51)

> I think copying the existing PartitionedSignal and naming the copy
> ElwidthSimdSignal is better than having it be a subclass...subclassing will
> make a bunch of messy unnecessary interactions.

it's surprisingly small (and regular), needing only 3 functions/properties

https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_repl/repl.py;h=364b7721c9ec896e9e5d0cbf133404f3631ad059;hb=HEAD#l90

1) the thing to switch on (line 90)
2) the list of values to create cases from (line 92)
3) (as an optimisation) a mask identifying padding-only lanes,
   stopping partial computations in padding lanes from taking
   place.

note 1 and 2 are *not* PartitionedSignal.__Switch__ they are
actual (scalar) Module m.Switch m.Case.

lines 90-92 in repl.py would become something like:

   with m.Switch(self.something.get_selector()):
       for pbit in self.something.get_allowed_cases():   

that is *literally* the extent of the modifications required
to allow PartitionedSignal submodules/subfunctions to adapt.

> Also, naming it Simd is waay
> shorter than Partitioned.

yeah i'll do a (big) rename in the morning. SimdSignal?

Comment 54 Luke Kenneth Casson Leighton 2021-10-09 08:01:05 BST

(In reply to Jacob Lifshay from comment #52)

> vf??? idk what that means...

it means it was late, i was tired, and hit totally the wrong letters :)
hilariously i think it was supposed to be the word "by" 

(In reply to Jacob Lifshay from comment #51)
> (In reply to Luke Kenneth Casson Leighton from comment #48)
> > except that there may be good reasons why other people (other customers
> > whom i cannot reveal or discuss because it is commercially confidential)
> 
> can you tell me privately? i was not aware of anyone using it...

done.  for the archives (and if there is an Audit), we should
be able to make an announcement soon.

> I'm kinda inclined to say "you decided to use unstable not even released
> software, you reap the consequences of said software changing", but I
> recognize that that's probably not helpful...

fortunately the partsig nmigen branch is a ridiculously small 400 line diff.
forward-merging indefinitely is a minor task if we are forced into that
situation by intransigent behaviour.  whitequark's objections here were based
on lack of knowledge of confidential business discussions underway at the
time. combined with input from "community members" who have no experience
in Supercomputing Arxhitecture stating "i have absolutely no need for that
in *my* code".
it allowed her to justify stating "there are clearly no use-cases beyond
Libre-SOC. case dismissed".  which is as shocking a judgement as it sounds,
when summarised.  she also neglected to inform us of the results of
a private discussion which reneged on an agreement to follow an RFC process,
only informing us months late, right when this is critical path.  what looks
to her like "trying to force our code onto another project" is down
to being backed into a corner and left with no other options to
meet our contractual obligations to NGI POINTER in time and within budget.
we have to document these things publicly unfortunately because of the
transparency obligations we're under.

there was a slashdot article only a couple days ago with the results of
a study on the effects of power: power does genuinely corrupt, causing
measurable neuro-degenerative effects! the lack of empathy resulting from
constantly making autocratic decisions apparently quite literally
causes brain damage over a long enough period. on reading the article,
it definitely made sense of a lot of behavioural patterns i've seen in
so many community projects.  quite an eye-opener.

Comment 55 Luke Kenneth Casson Leighton 2021-10-09 12:05:48 BST

https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/

i'm writing out notes based on conversations and examples here.
it's got big enough, needs some clarity.

Comment 56 Luke Kenneth Casson Leighton 2021-10-09 14:43:38 BST

(In reply to Luke Kenneth Casson Leighton from comment #53)

> yeah i'll do a (big) rename in the morning. SimdSignal?

realised it's best to wait for you respond before acting.

i am still mulling over how and whether Par...^W SimdSignal
can be adapted to support elwidths but also as a default
option create full power-2 partitions and cover all
permutations currently covered by PartitionedSignal.

if so then yes, PartitionedSignal conversion can go ahead.

the decision revolves around whether the (small!) adapter
described in comment #53 can "emulate" existing PartitionedSignal
behaviour, and i believe the answer is yes by:

1) returning Cat(PartitionPoints.mask.values()) as the switcher
2) returning range(len(PartitionPoints)) as the caser
3) returning *zero* as the "blanker" mask to indicate that
   there are no dead (padding) partitions.

[edit: the answer's yes]

https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=5a1c91609c64eace13bbb46902dba277e318aa9e

https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=4bbf1d04380bd2f15f8fc4cfb7537b9144a23f81

https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=5a81a6e7a0ae48119097928f15d4ab7a44192afa

Comment 57 Jacob Lifshay 2021-10-10 05:05:18 BST

(In reply to Luke Kenneth Casson Leighton from comment #54)
> (In reply to Jacob Lifshay from comment #52)
> > I'm kinda inclined to say "you decided to use unstable not even released
> > software, you reap the consequences of said software changing", but I
> > recognize that that's probably not helpful...
> 
> fortunately the partsig nmigen branch is a ridiculously small 400 line diff.

I had meant that PartitionedSignal wasn't stable nor released, I wasn't saying anything about nmigen. Next time, if possible, please at least email me, since I probably would have said PartitionedSignal probably isn't ready yet.

Comment 58 Jacob Lifshay 2021-10-10 05:41:01 BST

(In reply to Luke Kenneth Casson Leighton from comment #56)
> (In reply to Luke Kenneth Casson Leighton from comment #53)
> 
> > yeah i'll do a (big) rename in the morning. SimdSignal?

go ahead and rename. I'll figure stuff out on monday.
> 
> realised it's best to wait for you respond before acting.

:) thx.

> i am still mulling over how and whether Par...^W SimdSignal
> can be adapted to support elwidths but also as a default
> option create full power-2 partitions and cover all
> permutations currently covered by PartitionedSignal.

oh, well all permutations currently covered by PartitionedSignal (at least add/sub/logic-ops) are waay more than just the aligned, power-of-2 lanes.

I have an idea where there's a signal like SimdPartMode (which is: Simd split into abstract "parts", which are combined to create lanes -- 1-bit signal per part determines if a lane starts there), except each part_starts signal is an enum:
class FpPartStartKind(Enum):
    NotAStart = 4 # only one with bit set, hopefully simplifying logic
    F16 = 1 # adjust to match elwid's definition
    BF16 = 2
    F32 = 3
    F64 = 0

this allows defining lanes like so (128-bit Simd so I can demo more combinations):
part_starts = [
    FpPartStartKind.F16, # 1-part F16 lane in part 0
    FpPartStartKind.BF16, # 1-part BF16 lane in part 1
    FpPartStartKind.F32, # 2-part F32 lane in parts 2 and 3
    FpPartStartKind.NotAStart, # same lane as above
    FpPartStartKind.F64, # 4-part F64 lane in parts 4, 5, 6 and 7
    FpPartStartKind.NotAStart, # same lane as above
    FpPartStartKind.NotAStart, # same lane as above
    FpPartStartKind.NotAStart, # same lane as above
]

This may take a little more work to decode, and be more wires for part_starts, but it covers everything that both the simd_signal code I wrote and the layout() in comment #20 can do. it covers everything PartitionPoints can do, too, though we'll probably still want to restrict lanes to be aligned, with power-of-two part counts (since the part-width can be non-power-of-2, it can still save space that way -- e.g. a lane can be 4 parts wide, since 4 is a power-of-2, and parts can be 3-bits wide, allowing the lane to be 12-bits wide since 12=4*3).

This layout can be converted to PartitionPoints, by just setting the partition bit i * part_width for each part_starts[i] != NotAStart.

simd_signal.SimdPartMode.is_valid can be adapted to create a circuit used to feed into an Assume to get only valid part_starts combinations (checks that lanes are sized correctly, and are aligned correctly).

Comment 59 Jacob Lifshay 2021-10-10 05:45:35 BST

(In reply to Jacob Lifshay from comment #58)
> This layout can be converted to PartitionPoints, by just setting the
> partition bit i * part_width for each part_starts[i] != NotAStart.

more logic will be required to add partition points to split padding out, the above algorithm produces partitions that include their corresponding padding.

Comment 60 Luke Kenneth Casson Leighton 2021-10-10 11:51:40 BST

(In reply to Jacob Lifshay from comment #58)
> (In reply to Luke Kenneth Casson Leighton from comment #56)
> > i am still mulling over how and whether Par...^W SimdSignal
> > can be adapted to support elwidths but also as a default
> > option create full power-2 partitions and cover all
> > permutations currently covered by PartitionedSignal.
> 
> oh, well all permutations currently covered by PartitionedSignal (at least
> add/sub/logic-ops) are waay more than just the aligned, power-of-2 lanes.

yes, i know.  that's covered in
https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/

and it is covered by the enhancements to the layout() adapter function
i describe in comment #53.

the example in the (new) wiki page is a cleanup of the example in comment #2

there are actually NINE (nine!) underlying partition points created by
that innocuous-looking example.

so it looks like we need a completely insane 2^9 (512) permutations, right?

wrong.

there are only three in that example.

* elwidth=32
* elwidth=16
* elwidth=8

and that's where the adapter comes into play.  it defines what the switch
is, it defines the cases.

you DO NOT need a switch taking a 9-bit input

you DO NOT need a massive batch of 512 case statements

i wrote an adapter which covers the existing PartitionedSignal behaviour:
please take a look at the patches 
from comment #56 to understand it.

the "Elwdith" adapter is conceptually finished as well, based on the
code in comment #53.


> This may take a little more work to decode, and be more wires for
> part_starts, but it covers everything that both the simd_signal code I wrote
> and the layout() in comment #20 can do. it covers everything PartitionPoints
> can do, too,

jacooob: we're *not* throwing away code, we're *adapting* and slowly morphing
*existing* code, one incremental step at a time.

this is the project's base-line operating protocol

this means that we go from one known-tested to another known-tested
without scaring ourselves by taking on too much.

it does mean however that you need to have a full working knowledge of
the *existing* design.


> though we'll probably still want to restrict lanes to be
> aligned,

the Elwidth adapter from comment #53 is predicated on the fundamental
assumption that all elements start on power-two boundaries, yes.

> simd_signal.SimdPartMode.is_valid can be adapted to create a circuit used to
> feed into an Assume to get only valid part_starts combinations (checks that
> lanes are sized correctly, and are aligned correctly).

you're confusing me with a new design when my entire mindset is focussed
on the existing design.

if you describe things to me in terms of how the EXISTING design is
code-morphed *INTO* the proposed new design - not replaced by - CHANGED INTO -
i will be able to understand.

feet on the ground, jacob.

rename is done, btw.

commit df93b8f37bea53917d67e87ce2190981bc2ef667 (HEAD -> master, origin/master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Sun Oct 10 11:35:20 2021 +0100

    big rename PartitionedSignal to SimdSignal (shorter)

Comment 61 Luke Kenneth Casson Leighton 2021-10-10 11:55:40 BST

(In reply to Jacob Lifshay from comment #59)
> (In reply to Jacob Lifshay from comment #58)
> > This layout can be converted to PartitionPoints, by just setting the
> > partition bit i * part_width for each part_starts[i] != NotAStart.
> 
> more logic will be required to add partition points to split padding out,
> the above algorithm produces partitions that include their corresponding
> padding.

that's what the code in comment #53 is there to identify, and the
"blanking" mask is the OR of all of the unused (padding) partitions
under each case (elwidth=0b00/0b01/0b10/0b11).
(actually probably the AND, but hey).

the adapter is actually real simple.

Comment 62 Luke Kenneth Casson Leighton 2021-10-10 14:54:50 BST

https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=d90d9ca5a88c3f6adf8fb8d10d369fc89d803382

currently going through picking up bits of code from the layout()
experiment and dropping them in here 
https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;hb=HEAD

it's evolving but i'm trying to keep the order the same as the
conversation

Comment 63 Luke Kenneth Casson Leighton 2021-10-10 15:01:47 BST

this adds the phase 3 code from comment #34

https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=f91e3bcfecaa7d980327ff531e559e09513b524a

phase 1 and 2 was here

https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=95a5db6ed94749ca2079fa572030040b6a0235ee

and checking it, from comment #47:

https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=01f99ad3388d10f41c784bbc0a2bd4bf1eec777b

added *option* to fix the layout width, that's from comment #22:

https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=9d2cacebdf7ccaef9307903e9cba9780d766cf5f


going well - one incremental change at a time, morphing the code one
commit at a time, to incorporate each idea one at a time.

Comment 64 Luke Kenneth Casson Leighton 2021-10-10 15:30:54 BST

(In reply to Jacob Lifshay from comment #42)

> > and *nothing* else can take place
> > during that time.
> 
> false...luckily, if the globals

sorry, re-reading, you misunderstood: i was talking about a workflow / procedure,
not code.

during massive regularised code-changes no *other work* can take place
(everything grinds to a halt) because to try to mix in other work in
the middle of a massive regular systematic batch of code-changes results
in code-merges that are too insane to manage.

big regular systematic renames are an example.

Comment 65 Luke Kenneth Casson Leighton 2021-10-11 11:32:59 BST

okaay, so *now* we have something that can act as an adapter
(and more like a derivative of Shape())

https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=4f185874f40228471434679d1c2ffa1b23f9f3ca;hb=7e61aab66f2e444e83fd564e85dc957072ca1be4

it's also got code-comments explaining what's going on.

jacob: i *deduced* that you added the ability to specify the same
width for all partitions, after the fact:

  32     if not isinstance(lane_shapes, Mapping):
  33         lane_shapes = {i: lane_shapes for i in part_counts}

i was not able when initially reading the code from comment #20 to
work that out.  i therefore added three comment lines:

  29     # identify if the lane_shapes is a mapping (dict, etc.)
  30     # if not, then assume that it is an integer (width) that
  31     # needs to be requested across all partitions


you can see where it's going, how the adapter fits:

https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/partsig.py;h=bbad089360a070304dc2c5f1d38dfb29e6c56a6c;hb=7e61aab66f2e444e83fd564e85dc957072ca1be4#l84

in the ElwidthPartType adapter, the switch is on the elwidth and
the cases is the bitp dictionary.

still TODO in layout() is to merge all of the bitp dictionary values
to determine which parts of the partition are padding at *all* times.

Comment 66 Luke Kenneth Casson Leighton 2021-10-11 12:40:37 BST

https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=48f1e8c97dd4adc9d9161e362f389fbf70506033

+    # fourth stage: determine which partitions are 100% unused.
+    # these can then be "blanked out"
+    bmask = (1<<len(plist))-1
+    for p in bitp.values():
+        bmask &= ~p

done, but ironically (because the elwidth=0b11 partition
is 4x24-bit) there *are* no unused portions in the
example.

next stage will be to add an example which does have
blank unused padding parts through all elwidths

Comment 67 Luke Kenneth Casson Leighton 2021-10-11 14:42:56 BST

ah! i realised a couple of things:

1) setting a fixed width needs to be *optional* not
   mandatory.  we need *both* the option to specify
   that the width is to be per *element* (lane) as
   a fixed quantity *and* the option to set the overall
   width

2) and another option to say that based on the fixed
   width and based on the partition counts the element
   widths should be automatically calculated

def layout(elwid, part_counts, lane_shapes=None,
           fixed_width=None):
    if lane_shapes is None:
        lane_shapes = {}
        for p, c in part_counts.items():
           # calculate width of each partition
           lane_shapes[p] = lane_width // c
        # or would this work?
        # lane_shapes = {i: lane_width // lane_shapes[i]
                         for i in part_counts}
    if not isinstance(lane_shapes, Mapping):
        lane_shapes = {i: lane_shapes for i in part_counts}

of course an assert if both lane_shapes and fixed_width
are None.

basically, the actual registers are 64 bit so we *have* to
have a way to declare explicit subdivisions of exactly 64 bit.
but, for intermediary signals you want to say (such as for shift)
with absolute certainty that what you are using to shift is
a fixed width for all elwidths.

Comment 68 Jacob Lifshay 2021-10-12 04:37:42 BST

looking through the code, you got what part_counts is mixed up:
https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=1f0b761a019fae0fd0005a52649c5f5ad080bea2;hb=48f1e8c97dd4adc9d9161e362f389fbf70506033#l94
You wrote:
>  94     # for each element-width (elwidth 0-3) the number of partitions is given
>  95     # elwidth=0b00 QTY 1 partitions:   |          ?          |
>  96     # elwidth=0b01 QTY 1 partitions:   |          ?          |
>  97     # elwidth=0b10 QTY 2 partitions:   |    ?     |     ?    |
>  98     # elwidth=0b11 QTY 4 partitions:   | ?  |  ?  |  ?  | ?  |
>  99     # actual widths of Signals *within* those partitions is given separately
> 100     part_counts = {
> 101         0: 1,
> 102         1: 1,
> 103         2: 2,
> 104         3: 4,
> 105     }


part_counts is actually a dict from elwid values to the number of parts *in a single lane*, *not* the number of parts in the whole SIMD signal.

So, for the ALU's inputs/outputs (for the integer ALUs), part_counts' values are the number of bytes (since a byte is the smallest integer elwid) in each elwid type:
part_counts = {
    IntElWid.I8: 1, # i8 takes 1 byte
    IntElWid.I16: 2, # i16 takes 2 bytes
    IntElWid.I32: 4, # i32 takes 4 bytes
    IntElWid.I64: 8, # i64 takes 8 bytes
}

For the FP ALUs' inputs/outputs, part_counts' values are the number of 16-bit words, since the smallest FP elwid is 16-bits:
part_counts = {
    FpElWid.F16: 1, # f16 takes 1 16-bit word
    FpElWid.BF16: 1, # bf16 takes 1 16-bit word
    FpElWid.F32: 2, # f32 takes 2 16-bit words
    FpElWid.F64: 4, # f64 takes 4 16-bit words
}

Comment 69 Jacob Lifshay 2021-10-12 06:13:24 BST

reading through what you wrote on the wiki:
https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/

You appear to want to keep forcing SimdShape into being a Shape, and that causes problems with SimdShape's current design...

SimdShape can't be an instance of Shape as you are trying to do, because Shape.width is the width of a nmigen scalar Signal/Value, when vectorizing scalar code, Shape.width ends up being the lane width (excluding padding). The problem arises because lanes have different widths for different elwid values, so there *isn't* a single consistent integer value for Shape.width. Shape.width *is not* the same as SimdShape.width, since SimdShape.width is the width of the whole SIMD signal, which is, from the perspective of user code in an ALU, an arbitrary multiple of Shape.width. That multiplication factor is determined by the calling code and can arbitrarily vary (e.g. we could decide that we want 256-bit wide SIMD ALUs, or we could decide that we want to make a tiny 32-bit cpu and we want 32-bit wide SIMD ALUs (which have 0 64-bit lanes, 1 32-bit lane, 2 16-bit lanes, or 4 8-bit lanes)).

A possible solution that keeps SimdShape as a subclass of Shape is to rename the current SimdShape.width to something like .simd_width, allowing us to instead have SimdShape.width be a SimdMapInt (a subclass of SimdMap and int) of lane widths (the .width components of lane_shapes), that would meet the semantic requirements of Shape.

e.g.:

class SimdShape(Shape): # aka. SimdLayout
    def __init__(self, lane_shapes):
        if isinstance(lane_shapes, SimdShape):
            self.lane_shapes = lane_shapes.lane_shapes
        else:
            # convert to SimdMap with Shapes for values, handles
            # input dicts, SimdMaps, Shapes, ints, Enums, etc.
            self.lane_shapes = SimdMap.map(Shape.cast, lane_shapes)
        signed = None
        for elwid in global_scope().lane_counts.keys():
            lane_shape = self.lane_shapes.values[elwid]
            if signed is None:
                signed = lane_shape.signed
            else:
                assert signed == lane_shape.signed
            assert not isinstance(lane_shape, SimdShape), \
                "can't have a SimdShape for a lane's shape"
        # Shape.__init__ doesn't like non-int widths, so pass in fake width for now
        super().__init__(1, signed)
        # width is a SimdMap with each lane's bit width for values
        width = SimdMapInt(SimdMap.map(lambda s: s.width, self.lane_shapes))
        # finish calculating layout...
        ...

# needed since Shape requires width to be an int
class SimdMapInt(SimdMap, int):
    def __init__(self, values):
        # convert every value to an int
        values = SimdMap.map(int, values)
        super().__init__(values)

so, a SimdShape would end up with fields like so:
part_counts = {
    ElWid.F16: 1, # f16 takes 1 16-bit word
    ElWid.BF16: 1, # bf16 takes 1 16-bit word
    ElWid.F32: 2, # f32 takes 2 16-bit words
    ElWid.F64: 4, # f64 takes 4 16-bit words
}
with simd_scope(part_counts, total_parts=8): # 8 16-bit parts for 128-bit SIMD
    s = SimdShape(XLEN)

s.signed == False
s.width == SimdMapInt({
    ElWid.F16: 16,
    ElWid.BF16: 16,
    ElWid.F32: 32,
    ElWid.F64: 64,
})
s.lane_shapes == SimdMap({
    ElWid.F16: unsigned(16),
    ElWid.BF16: unsigned(16),
    ElWid.F32: unsigned(32),
    ElWid.F64: unsigned(64),
})
s.simd_width == 128 # what used to be SimdShape.width, since renamed
s.part_count = 8 # 8 16-bit parts
s.part_width = 16
... # rest of `s`'es fields omitted

sig = SimdSignal(s)
sig.shape() == s
sig.sig.shape() == unsigned(128)


another example:
with simd_scope(part_counts, total_parts=4): # 4 16-bit parts for 64-bit SIMD
    s = SimdShape(XLEN)

s.signed == False
s.width == SimdMapInt({
    ElWid.F16: 16,
    ElWid.BF16: 16,
    ElWid.F32: 32,
    ElWid.F64: 64,
})
s.lane_shapes == SimdMap({
    ElWid.F16: unsigned(16),
    ElWid.BF16: unsigned(16),
    ElWid.F32: unsigned(32),
    ElWid.F64: unsigned(64),
})
s.simd_width == 64 # what used to be SimdShape.width, since renamed
s.part_count = 4 # 4 16-bit parts
s.part_width = 16
... # rest of `s`'es fields omitted

sig = SimdSignal(s)
sig.shape() == s
sig.sig.shape() == unsigned(64)

Comment 70 Jacob Lifshay 2021-10-12 06:57:54 BST

(In reply to Luke Kenneth Casson Leighton from comment #60)

thinking about it again, maybe we need to have a scoped global dict of supported partitioning modes (which elwidths to use for which lanes):
example for FP:
simd_modes = {
    # 4 lanes each with the shape for F16
    SimdMode.F16x4: (FpElWid.F16,) * 4,
    # 4 lanes each with the shape for BF16
    SimdMode.BF16x4: (FpElWid.BF16,) * 4,
    # 2 lanes each with the shape for F32
    SimdMode.F32x2: (FpElWid.F32,) * 2,
    # 1 lane with the shape for F64
    SimdMode.F64x1: (FpElWid.F32,),
    # we can add other combinations later, such as:
    #SimdMode.F16x2_F32x1: (FpElWid.F16,) * 2 + (FpElWid.F32,),
    # or even int/fp combinations:
    #SimdMode.I32x1_F32x1: (IntElWid.I32, FpElWid.F32,),
}

(By this point, ElWid would only be used by the SimdSignal system as a lane kind designator, so maybe we should have a separate LaneKind enum type and not use ElWid here, ElWid could still be used for the elwid Signal, of course)

For an ALU:
simd_mode = Signal(SimdMode)
# simd_mode replaces elwid as the Signal that determines which
# lanes are enabled...it can be set based on elwid of course.

so, if we have the current value of simd_mode == SimdMode.F16x2_F32x1, then:
s = SimdSignal(XLEN)
# s has lanes with shapes u16, u16, and u32.

mantissa_field = SimdSignal({
    ElWid.F16: 10,
    ElWid.BF16: 7,
    ElWid.F32: 23,
    ElWid.F64: 52,
})
# mantissa_field has lanes of u10, u10, and u23

unbiased_exponent = SimdSignal({
    ElWid.F16: signed(5),
    ElWid.BF16: signed(8),
    ElWid.F32: signed(8),
    ElWid.F64: signed(11),
})
# exponent has lanes of s5, s5, and s8

Comment 71 Luke Kenneth Casson Leighton 2021-10-12 12:27:31 BST

(In reply to Jacob Lifshay from comment #69)
> reading through what you wrote on the wiki:
> https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/
> 
> You appear to want to keep forcing SimdShape into being a Shape, and that
> causes problems with SimdShape's current design...
> 
> SimdShape can't be an instance of Shape as you are trying to do, because
> Shape.width is the width of a nmigen scalar Signal/Value, 

yes, perfect.  that means that any down-casting, in combination with
the underlying Signal (SimdSignal.sig, a private member), perfectly
allows down-casting.

if Signal.cast() or Signal.like() is called, i expect it to grab the
underlying Signal, grab the width, grab the sign, and blat the entirety
of the underlying Signal into the (new) one.

done.

no hassle.

no modifications to nmigen required to get that to work.

this was the whole point of UserValue (and UserCastable), that you
are *required* to provide compliance with Shape().

> when vectorizing
> scalar code, Shape.width ends up being the lane width (excluding padding).
> The problem arises because lanes have different widths for different elwid
> values, so there *isn't* a single consistent integer value for Shape.width.
> Shape.width *is not* the same as SimdShape.width, 

ah.  then this is where you have become confused.  it is *defined* so.
that's the whole basis of PartitionedSignal / SimdSignal, that there
*is* an underlying signal (private member, SimdSignal.sig) which *is*
by definition exactly that.

now, will there be blank unused portions of that underlying Signal named
sig? yes.  is it the job of SimdShape to help identify them? yes.

all of the eq and other functions have been *specifically* designed
around the underlying signal being of length and sign Shape.width
and Shape.signed.

that's the whole point.

> since SimdShape.width is
> the width of the whole SIMD signal, which is, from the perspective of user
> code in an ALU, an arbitrary multiple of Shape.width. That multiplication
> factor is determined by the calling code and can arbitrarily vary (e.g. we
> could decide that we want 256-bit wide SIMD ALUs, or we could decide that we
> want to make a tiny 32-bit cpu and we want 32-bit wide SIMD ALUs (which have
> 0 64-bit lanes, 1 32-bit lane, 2 16-bit lanes, or 4 8-bit lanes)).

yes.  Shape.width == 256.  or Shape.width==32.  downcasting with Value.cast
will pick both of those up.

> A possible solution that keeps SimdShape as a subclass of Shape is to rename
> the current SimdShape.width to something like .simd_width, allowing us to
> instead have SimdShape.width be a SimdMapInt (a subclass of SimdMap and int)
> of lane widths (the .width components of lane_shapes), that would meet the
> semantic requirements of Shape.

uh-uh, that's the wrong way round.  yes it's one of the possibilities, but
it's a mistake: it doesn't allow "clean" downcasting.

the route that you are suggesting would require extensive modification
of nmigen for nmigen to comprehend a new concept: simd_width.

the idea here is *not* to make extensive modifications of nmigen.  we currently
have a ridiculously-small 420-line diff.

no way we're doing extensive nmigen modifications, it would be suicidal for
the project.

Comment 72 Luke Kenneth Casson Leighton 2021-10-12 12:45:33 BST

(In reply to Jacob Lifshay from comment #68)
> looking through the code, you got what part_counts is mixed up:
> https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/
> layout_experiment.py;h=1f0b761a019fae0fd0005a52649c5f5ad080bea2;
> hb=48f1e8c97dd4adc9d9161e362f389fbf70506033#l94
> You wrote:
> >  94     # for each element-width (elwidth 0-3) the number of partitions is given
> >  95     # elwidth=0b00 QTY 1 partitions:   |          ?          |
> >  96     # elwidth=0b01 QTY 1 partitions:   |          ?          |
> >  97     # elwidth=0b10 QTY 2 partitions:   |    ?     |     ?    |
> >  98     # elwidth=0b11 QTY 4 partitions:   | ?  |  ?  |  ?  | ?  |
> >  99     # actual widths of Signals *within* those partitions is given separately
> > 100     part_counts = {
> > 101         0: 1,
> > 102         1: 1,
> > 103         2: 2,
> > 104         3: 4,
> > 105     }
> 
> 
> part_counts is actually a dict from elwid values to the number of parts *in
> a single lane*, *not* the number of parts in the whole SIMD signal.

tcktcktck... *thinks*... where's the misunderstanding... we need some
terminology.

SIMD Lanes seems to refer to the "whole blob" (the entire batch of
elements, irrespective of subdivision).
http://15418.courses.cs.cmu.edu/spring2015/lecture/gpuarch/slide_003

however i have seen definitions where SIMD "lane" refers to the element.
therefore let's try to avoid its use, but assume for the current context
it means "the whole blob irrespective of element-level subdivisions"

    "part_counts is actually a dict from elwid values to the number
    of parts *in a single lane*"

yep, i'd agree with that [caveat, "lane" being defined as "the whole blob"
e.g. the ALU width)

    "*not* the number of parts in the whole SIMD signal"

which would be *after* the subdivisions have taken place, and could
potentially include blank (padding) sections.  i.e. PartitionPoints.

would you agree then that one was "requested" and the other "allocated"?
(sort-of.  PartitionPoints is admittedly a different data structure,
but layout with the elwidth==i stuff makes it *possible* to express
part_points-including-padding)

Comment 73 Luke Kenneth Casson Leighton 2021-10-12 13:26:12 BST

i think we're slowly getting to the bottom of where the assumptions
are that caused you to believe that it is impossible to use nmigen
with such tiny modifications (and to believe that it is undesirable
to do so).

it revolves around elwidths and the definition of Shape.width.

the past 18 months work have been fundamentally based on the assumption
that it is, by design, possible to fit.  much of that time was spent
verifying that that was possible.

the assumption / definition and all code is based on:

Shape.width == SimdShape.width == len(SimdSignal._its_internalsignal)

we need - and other users will expect - that Casting will "just work".
SimdSignal should be castable to Signal and all bits *directly*
copied (including unused padding partitions) seamlessly at the bitlevel.

likewise Record, and other nmigen constructs (sigh yes we need to do
a SimdRecord at some point. not a priority]

this *is* how nmigen works.  all Value-derivatives *are* copyable
from one to the other by making the fundamental assumption that
when converted to bitlevel they are all effectively "the same".

https://git.libre-soc.org/?p=nmigen.git;a=blob;f=nmigen/hdl/ast.py;h=372148205697a78280cee4c84cdf20ff2ec173e4;hb=6e7329d5cc593e05dfcb1f770a9cd878226a44ce#l1066

this is Signal.like():

1066         kw = dict(shape=Value.cast(other).shape(), name=new_name)
1067         if isinstance(other, Signal):
1068             kw.update(reset=other.reset, reset_less=other.reset_less,
1069                       attrs=other.attrs, decoder=other.decoder)
1070         kw.update(kwargs)
1071         return Signal(**kw, src_loc_at=1 + src_loc_at)

fundamentally for that to work it is *required* that SimdSignal
have a shape() function that returns a Shape() or Shape-derivative.

the *entirety* of PartitionedSignal (now SimdSignal) is *fundamentally*
designed on this principle.

the element widths are an addition that is *our* problem to conceptualise
and deal with, and as they are our problem they need to take second
priority when fitting onto nmigen base-level pre-existing concepts.

this is standard textbook OO design, and with nmigen being so well
designed it's remarkable and fortunate that this is even possible.

if however we choose to define the *element* width as the priority
then we are in deep s*** basically because it would require a catastrophic
cascade of modifications to nmigen *and* abandonment of 18 months of
work.

i am extremely good at identifying these kinds of "minimum retrofit"
opportunities, i have been doing it for 20 years, and it would be foolish
not to exploit them (even if the lead developer does not wish us to, and
reneged on an agreement to consider so)

Comment 74 Luke Kenneth Casson Leighton 2021-10-12 14:26:41 BST

(In reply to Luke Kenneth Casson Leighton from comment #67)

> def layout(elwid, part_counts, lane_shapes=None,
>            fixed_width=None):
>     if lane_shapes is None:
>         # or would this work?
>         # lane_shapes = {i: lane_width // lane_shapes[i]
>                          for i in part_counts}

actually, 

    lane_shapes = {i: lane_width // part_counts[i]
                          for i in part_counts}

and it works well.

the problem is that you've defined lane_shapes to *be* the
sum total of the element width *times* the part_count.

* for an element width of e.g. 16
* and for a part_count[2] == 2
* you require lane_shapes[2] to be **32** not 16

this makes no sense, and is completely counterintuitive.

part_count defines the number of partitions in each elwidth

lane_shapes naturally should define the *individual element*
width *not* the element width times the number of partitions.

otherwise that information bleeds back into the actual usage
and it becomes impossible to define a transparent API that
allows dynamic SIMD or Scalar compile-time switching

Comment 75 Luke Kenneth Casson Leighton 2021-10-12 14:48:24 BST

arrgh, ok now i understand what's going on, it's possible to identify
that this is fundamentally broken:

    # compute a set of partition widths
    cpart_wid = [-lane_shapes[i] // c for i, c in part_counts.items()]
    print ("cpart_wid", cpart_wid, "part_counts", part_counts)
    cpart_wid = -min(cpart_wid)

for the example given 

    part_counts {0: 1, 1: 1, 2: 2, 3: 4}
    widths_at_elwidth = { 0: 5, 1: 6, 2: 12, 3: 24}

it creates:

    cpart_wid [-5, -6, -6, -6] 

and produces a minimum of 5 for cpart_wid.

whereas what is required is:

    part_counts {0: 1, 1: 1, 2: 2, 3: 4}
    widths_at_elwidth = { 0: 5, 1: 6, 2: 6, 3: 6} # NO part_count mul-factor

and then for the results to be:

    # elwidth=0b00 1x 5-bit     |    unused                   ....5 |
    # elwidth=0b01 1x 6-bit     |      unused                .....6 |
    # elwidth=0b10 2x 12-bit    |  unused           .....6 | .....6 |
    # elwidth=0b11 4x 24-bit    | .....6 | .....6 | .....6 | .....6 |

and therefore there to be ***FOUR*** partition points ***NOT***
14 as is currently the case by defining the base width to be 5

                                v        v        v        vv       v
    # elwidth=0b00 1x 5-bit     |        |        |        || ....5 |
    # elwidth=0b01 1x 6-bit     |        |        |        | .....6 |
    # elwidth=0b10 2x 12-bit    |        |        | .....6 | .....6 |
    # elwidth=0b11 4x 24-bit    | .....6 | .....6 | .....6 | .....6 |

the partition points created should therefore be at:

* 0 (which gets deleted, later)
* 5 (to fit the single 5-bit value when elwidth=0b00)
* 6 (to fit the single 6-bit value when elwidth=0b01 and all other
  1st elements
* 12 and 18 (to fit the other 2 elwidths 0b10 and 0b11
* 24 (which gets deleted, later)

layout as designed creates something absolutely mad, something
in excess of 14 partition points!

test elwidth=0
(0,
 ([1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0],
  64,
  {0: 5, 1: 6, 2: 12, 3: 24},
  16,
  4))

creating that compact set is a lot more involved than simply selecting
the minimum width and dictating that all partitions shall fit onto
an aligned boundary that must always begin at a multiple of that
minimum width.

intuition tells me it should probably be the other way round: start
at the maximum element width and subdivide all others to fit into
that.  round up all element widths within an elwidth to that max
value.  assume all partition counts are power-of-two, and i think
this is the basis of an algorithm that would slot everything onto
suitable "shared" boundaries.

Comment 76 Luke Kenneth Casson Leighton 2021-10-12 18:36:52 BST

rright.

ok.

i think i have a basis for the algorithm, it is multi-step.

1. (new) assume all partition counts are a power-2.  1 2 4 8 ...
2. use either fixed_width (if given) or a computed total width.
3. obtain the max number of partitions (max(part_counts.values())
4. divide total width by maxparts, total width is rounded up.

(4) becomes the partition subdivision size.

from there i think the rest of the algorithm (phase 1) 
takes over.

for the example in layout_exoeriment.py that subdivision
size should turn out to be *6* not 5.

Comment 77 Jacob Lifshay 2021-10-12 18:39:30 BST

(In reply to Luke Kenneth Casson Leighton from comment #75)
> arrgh, ok now i understand what's going on, it's possible to identify
> that this is fundamentally broken:
> 
>     # compute a set of partition widths
>     cpart_wid = [-lane_shapes[i] // c for i, c in part_counts.items()]
>     print ("cpart_wid", cpart_wid, "part_counts", part_counts)
>     cpart_wid = -min(cpart_wid)
> 
> for the example given 
> 
>     part_counts {0: 1, 1: 1, 2: 2, 3: 4}
>     widths_at_elwidth = { 0: 5, 1: 6, 2: 12, 3: 24}
> 
> it creates:
> 
>     cpart_wid [-5, -6, -6, -6] 
> 
> and produces a minimum of 5 for cpart_wid.

no, it produces a minimum of -6, since -6 is less than -5.

Comment 78 Jacob Lifshay 2021-10-12 18:48:08 BST

(In reply to Luke Kenneth Casson Leighton from comment #74)
> the problem is that you've defined lane_shapes to *be* the
> sum total of the element width *times* the part_count.

no, i didn't. a lane (minus padding) *is always* a single element. a lane for one elwid (including padding) is split into multiple smaller lanes for a *different* elwid.
> 
> * for an element width of e.g. 16
> * and for a part_count[2] == 2
> * you require lane_shapes[2] to be **32** not 16

lane_shapes[2] = unsigned(16) for an element shape of unsigned(16).

lane_shapes *never* includes padding.
> 
> this makes no sense, and is completely counterintuitive.
> 
> part_count defines the number of partitions in each elwidth
> 
> lane_shapes naturally should define the *individual element*
> width *not* the element width times the number of partitions.

yup!

> otherwise that information bleeds back into the actual usage
> and it becomes impossible to define a transparent API that
> allows dynamic SIMD or Scalar compile-time switching

Yup!

Comment 79 Luke Kenneth Casson Leighton 2021-10-12 19:26:55 BST

(In reply to Jacob Lifshay from comment #77)

> > and produces a minimum of 5 for cpart_wid.
> 
> no, it produces a minimum of -6, since -6 is less than -5.

ngghhh ok yes.  min -ve then negate is max.  tricks like
that always need documenting.

ok so what the heck is going on.

the new code i added, ended up computing really weird partition
widths, and the example you created with 5 6 6 6 produces 14
points.

Comment 80 Luke Kenneth Casson Leighton 2021-10-12 19:31:58 BST

(In reply to Jacob Lifshay from comment #78)
> (In reply to Luke Kenneth Casson Leighton from comment #74)
> > the problem is that you've defined lane_shapes to *be* the
> > sum total of the element width *times* the part_count.
> 
> no, i didn't. a lane (minus padding) *is always* a single element.

ok whew, that's what i'd expect.

>  a lane
> for one elwid (including padding) is split into multiple smaller lanes for a
> *different* elwid.
> > 
> > * for an element width of e.g. 16
> > * and for a part_count[2] == 2
> > * you require lane_shapes[2] to be **32** not 16
> 
> lane_shapes[2] = unsigned(16) for an element shape of unsigned(16).

just 16.  not unsigned(16).

> 
> lane_shapes *never* includes padding.

yes agreed.  this makes sense.

> > 
> > this makes no sense, and is completely counterintuitive.
> > 
> > part_count defines the number of partitions in each elwidth
> > 
> > lane_shapes naturally should define the *individual element*
> > width *not* the element width times the number of partitions.
> 
> yup!
> 
> > otherwise that information bleeds back into the actual usage
> > and it becomes impossible to define a transparent API that
> > allows dynamic SIMD or Scalar compile-time switching
> 
> Yup!

nggggh so why does the example i added where lane_shapes=None
calculates this based on same part_count and fixed_width=32:

0: 32
1: 32
2: 16
3: 8

but then the partition cpart_wid gets computed as completely
wrong, like 32 32 8 2 or something.

can you take a look?

Comment 81 Luke Kenneth Casson Leighton 2021-10-12 19:41:12 BST

cpart_wid = -min(-lane_shapes[i] // c for i, c in part_counts.items())

that's definitely dividing the lane_shape _total_ by the part count.

it should probably just be

cpart_wid = -min(-lane_shapes[i] for i, c in part_counts.items())

and the input, lane_shapes, should be

0: 5
1: 6
2: 6
3: 6

you definitely gave:

0: 5
1: 6
2: 12
3: 24

check comment #20

for i in range(4):
    l = {0: signed(5), 1: signed(6), 2: signed(12), 3: signed(24)}
    pprint((i, layout(i, part_counts, l)))

Comment 82 Jacob Lifshay 2021-10-12 19:51:20 BST

(In reply to Luke Kenneth Casson Leighton from comment #73)
> i think we're slowly getting to the bottom of where the assumptions
> are that caused you to believe that it is impossible to use nmigen
> with such tiny modifications (and to believe that it is undesirable
> to do so).
> 
> it revolves around elwidths and the definition of Shape.width.
> 
> the past 18 months work have been fundamentally based on the assumption
> that it is, by design, possible to fit.  much of that time was spent
> verifying that that was possible.
> 
> the assumption / definition and all code is based on:
> 
> Shape.width == SimdShape.width == len(SimdSignal._its_internalsignal)

well, the problem is that, to correctly maintain the appearance of operating on *scalar* values needed for all our existing OpenPower ALUs (waay more code than PartitionedSignal), we need it to look like a single element, *not* a list. This means SimdSignal.width *has to be* the width of a *single* element.
> 
> we need - and other users will expect - that Casting will "just work".
> SimdSignal should be castable to Signal and all bits *directly*
> copied (including unused padding partitions) seamlessly at the bitlevel.

we need Signal/SimdSignal interconversion, but it should be done via having non-SIMD code access SimdSignal.sig, or by a .bitcast member function, not by just doing Signal.eq. In particular, what your advocating for (for SimdSignal to transparently inter-convert with Signal), just 1:1 copying bits, will *not work* nicely -- it's in conflict with what happens when a scalar is converted to a vector, which is that that scalar is *splatted* into all lanes of the vector, not that the scalar is split into pieces and each piece becomes a lane.

Example (addi):
class MyAddI(...):
    def __init__(self):
        with simd_scope(self, create_elwid=True):
            self.RT = SimdSignal(XLEN) # output: vector
            self.RA = SimdSignal(XLEN) # input: vector
            self.Imm = Signal(signed(16)) # input: *scalar*, since it's not a register
    def elaborate(self, platform):
        m = Module()
        with simd_scope(self, m=m):
            sliced_imm = SimdSignal(XLEN)
            m.d.comb += sliced_imm.eq(self.Imm) # slice off bits that don't fit in XLEN
            m.d.comb += self.RT.eq(self.RA + sliced_imm)
        return m

In order for the semantics to be correct, the conversion from self.Imm to a SimdSignal has to put *all* of Imm's value into each lane -- *instead of what you are advocating for* -- what you are advocating for would split Imm into 2-bit/4-bit/8-bit/16-bit pieces and put each of those pieces into each corresponding lane, which is exactly what a 1:1 bit copy does, which is *totally wrong* here.

Example (with elwid == 16-bits):
RA = 0x0123_4567_89AB_CDEF
instruction:
addi rt, ra, 0x7531

Right way (splatting):
RA (lanes): [0xCDEF, 0x89AB, 0x4567, 0x0123]
Imm: 0x7531
sliced_imm (lanes): [0x7531, 0x7531, 0x7531, 0x7531]
RT (lanes): [0xCDEF + 0x7531, 0x89AB + 0x7531, 0x4567 + 0x7531, 0x0123 + 0x7531] ==
    [0x4320, 0xFEDC, 0xBA98, 0x7654]
RT = 0x7654_BA98_FEDC_4320

Wrong way (1:1 bit conversion -- convert imm to 16-bit wide simd then sign extend):
RA (lanes): [0xCDEF, 0x89AB, 0x4567, 0x0123]
Imm: 0x7531
sliced_imm (1:1 converted into lanes): [0x1, 0x3, 0x5, 0x7]
RT (lanes): [0xCDEF + 0x1, 0x89AB + 0x3, 0x4567 + 0x5, 0x0123 + 0x7] ==
    [0xCDF0, 0x89AE, 0x456C, 0x012A]
RT = 0x012A_456C_89AE_CDF0

Another wrong way (1:1 bit conversion -- sign extend imm to simd width then convert):
RA (lanes): [0xCDEF, 0x89AB, 0x4567, 0x0123]
Imm: 0x7531
sliced_imm (1:1 converted into lanes, only lower 16-bits of full 64-bits had anything):
    [0x1357, 0x0, 0x0, 0x0]
RT (lanes): [0xCDEF + 0x1357, 0x89AB + 0x0, 0x4567 + 0x0, 0x0123 + 0x0] ==
    [0x7654, 0x89AB, 0x4567, 0x0123]
RT = 0x0123_4567_89AB_7654

> this *is* how nmigen works.  all Value-derivatives *are* copyable
> from one to the other by making the fundamental assumption that
> when converted to bitlevel they are all effectively "the same".

Well, SimdSignal is *fundamentally* incompatible with nmigen Value/Signal, since nmigen expects Values to act like a single scalar value, whereas SimdSignal acts like a *list* of scalar values -- aka. a Vector.

Comment 83 Jacob Lifshay 2021-10-12 19:56:43 BST

(In reply to Luke Kenneth Casson Leighton from comment #81)
> cpart_wid = -min(-lane_shapes[i] // c for i, c in part_counts.items())
> 
> that's definitely dividing the lane_shape _total_ by the part count.

that's because that's computing the width of *one* part. a single lane can occupy multiple parts (a lane occupies part_counts[i] parts)

> you definitely gave:
> 
> 0: 5
> 1: 6
> 2: 12
> 3: 24

yup, because I *wanted* lanes to be 5, 6, 12, and 24 bits wide, not 5, 6, 6, and 6. That was to demonstrate how the part width is calculated to fit the minimum required bits, which can be less than a lane's width, because that lane occupies multiple parts.

Comment 84 Jacob Lifshay 2021-10-12 20:09:10 BST

(In reply to Jacob Lifshay from comment #83)
> (In reply to Luke Kenneth Casson Leighton from comment #81)
> > cpart_wid = -min(-lane_shapes[i] // c for i, c in part_counts.items())
> > 
> > that's definitely dividing the lane_shape _total_ by the part count.

it's dividing the width of one lane by the number of parts in that lane, getting the number of bits that lane needs per-part.
> 
> that's because that's computing the width of *one* part. a single lane can
> occupy multiple parts (a lane occupies part_counts[i] parts)

Comment 85 Luke Kenneth Casson Leighton 2021-10-12 20:19:27 BST

(In reply to Jacob Lifshay from comment #82)

> well, the problem is that, to correctly maintain the appearance of operating
> on *scalar* values needed for all our existing OpenPower ALUs (waay more
> code than PartitionedSignal), we need it to look like a single element,
> *not* a list. This means SimdSignal.width *has to be* the width of a
> *single* element.

no!

jacob: i *already have* SimdSignal Assign functional and correctly
working.

note, i did not say "eq" i said "Signal.like" and talked about casting.
that's completely different.

to move the topic to SimdSignal.eq:

* SimdSignal.eq *already detects* the type of object that it is
  being Assigned to.

* for a SimdSignal.eq(Signal) it performs a scalar-to-vector
  elwidth-context-aware duplication (Vector ISA "broadcast").  

* for a SimdSignal.eq(SimdSignal) it performs a full range of detection
  of partition sizes (smaller, equal, greater) and deals with them.

*this is already done*.

*it is functional*

you are weeks if not months behind on the functionality of the design
which *already works*

here is the revelant code:

https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_ass/assign.py;h=2a79cbbc596b65ea893b02cbeb79127811326bc5;hb=4fca5d99b872bf14c61d2e73932c9bf340cfd22d#l61

that's all it takes.

it's even documented

https://libre-soc.org/3d_gpu/architecture/dynamic_simd/assign/

note: it even works with signed and unsigned, performing
zero/sign-extension or truncation as necessary, for both
vector-to-vector and scalar-to-vector assignment.

furthermore, vector-to-scalar (scalar.eq(vector) will work by
auto-casting down through UserValue down-casting to bit-level.
which is exactly what is expected and desired.



> > 
> > we need - and other users will expect - that Casting will "just work".
> > SimdSignal should be castable to Signal and all bits *directly*
> > copied (including unused padding partitions) seamlessly at the bitlevel.
> 
> we need Signal/SimdSignal interconversion, but it should be done via having
> non-SIMD code access SimdSignal.sig, or by a .bitcast member function, not
> by just doing Signal.eq.

i did the analysis 2 weeks ago and successfully worked it out and successfully
implemented Signal.eq.

there is no need and absolutely no desire to complicate things that have
already been successfully done and completed.

>  In particular, what your advocating for (for
> SimdSignal to transparently inter-convert with Signal),

for *casting* - yes.  when the developer *asks* for that, by using
Signal.like and uses casting.... yes.

>  just 1:1 copying
> bits, will *not work* nicely -- it's in conflict with what happens when a
> scalar is converted to a vector, which is that that scalar is *splatted*
> into all lanes of the vector, not that the scalar is split into pieces and
> each piece becomes a lane.

already handled.
 
> Well, SimdSignal is *fundamentally* incompatible with nmigen Value/Signal,

no it is not!  it's already been implemented, and *is* compatible!

> since nmigen expects Values to act like a single scalar value, 

no.  whitequark *wants* to *define* nmigen to be thus.  whitequark,
just before banning me from participation on #nmigen, made it bluntly
clear, as part of reneging on the agreement back in april, that this
decision was made.

whether the code *can* support full abstraction (which with a mere
420-line diff, it can), whitequark unilaterally decided that that was
completly irrelevant.

> whereas
> SimdSignal acts like a *list* of scalar values -- aka. a Vector.

yes.  and therefore the entire suite of operators of SimdSignal can be
*defined* to act in list-form, at every single level in the entirety of ast.*
operations.

it *works* - full stop.  there is not one single unit test in the entirety
of test_partsig.py which is not already demonstrating and proving that this
concept - fact - works - fact.

please listen to what i am telling you: it works! fact!

have you run the unit test and / or contributed to the individual modules?

Comment 86 Luke Kenneth Casson Leighton 2021-10-12 20:28:07 BST

(In reply to Jacob Lifshay from comment #83)
> (In reply to Luke Kenneth Casson Leighton from comment #81)
> > cpart_wid = -min(-lane_shapes[i] // c for i, c in part_counts.items())
> > 
> > that's definitely dividing the lane_shape _total_ by the part count.
> 
> that's because that's computing the width of *one* part. a single lane can
> occupy multiple parts (a lane occupies part_counts[i] parts)

i did say: it's important to stop using the word "lane" because it has
multiple definitions.

> 
> yup, because I *wanted* lanes to be 5, 6, 12, and 24 bits wide, not 5, 6, 6,
> and 6. That was to demonstrate how the part width is calculated to fit the
> minimum required bits, which can be less than a lane's width, because that
> lane occupies multiple parts.

ok.  so now we come back to comment #75, that by defining widths in terms
of "lanes" is the wrong thing to do.

what you are calling "lane" is "part_count" times "element width".

what is *actually needed* is just, "element width".

this is because when converting Scalar code to SIMD, it is necessary
to allow a Signal of width e.g. 16 to be multiplied by the number
of partitions *automatically* when that Signal class is contextually-replaced
with SimdSignal class (via a compile-time switch).

if you force people to put what you call a "lane" width in,
that Signal width of 16 will become subdivided by 2 for elwidth=32,
subdivided by 4 for elwidth=16 and subdivided by 8 for elwidth=8

how can you fit a 16-bit computation into 2-bit data when elwidth=8?

you can't, can you?

therefore, so as not to make massive intrusive changes to the Scalar
code, polluting it with SIMD-style context that it doesn't need,
the *element* width has to be passed in to layout() *not*
element width times partition count.

by passing that *element* width in, layout() will allocate 16-bit
elements *of that size*, at *all* elwidths.

this is the desired behaviour

elwidth times partition count is not.

is this clear, now?

Comment 87 Jacob Lifshay 2021-10-12 22:24:16 BST

(In reply to Luke Kenneth Casson Leighton from comment #85)
> (In reply to Jacob Lifshay from comment #82)
> > > we need - and other users will expect - that Casting will "just work".
> > > SimdSignal should be castable to Signal and all bits *directly*
> > > copied (including unused padding partitions) seamlessly at the bitlevel.

Ok, I read "casting" as "Signal.eq", and didn't think beyond that, my bad.

> > Well, SimdSignal is *fundamentally* incompatible with nmigen Value/Signal,
> 
> no it is not!  it's already been implemented, and *is* compatible!
> 
> > since nmigen expects Values to act like a single scalar value, 
> 
> no.  whitequark *wants* to *define* nmigen to be thus.

And I agree with her there, nmigen has always been scalar, where each Value is a single value, not a list of values. Basically everybody expects nmigen's Value to be a scalar. If you insist on rewriting nmigen to change that fundamental assumption, you'll end up with a different language, that is called nmigen only because you are confusing your fundamental assumption-breaking changes as being compatible with nmigen, and you can sorta get away with it because nmigen is written in Python and only checks some of the typing.

I'm advocating for making a set of classes SimdSignal, SimdValue, SimdShape, etc. that effectively is a Simd-ified version of nmigen Signal, Value, Shape, etc. *and their named differently* because they fundamentally are lists of scalars, not single scalars.

>  whitequark,
> just before banning me from participation on #nmigen,

oh, sad and surprised to hear that...I'll wait a while and then see if I can convince whitequark to let you back, since your behavior wasn't bad enough to warrant a ban imho (unless something happened that I didn't see), though apologising for making mwk feel as if you're talking down to them wouldn't hurt (apologising for what they percevied as condescending, which is not what you intended). pointing out that they're being rude, while true, won't actually help everyone get along. pointing out and asking people to not be rude is mostly whitequark's responsibility, not yours, since she is the channel moderator. Hopefully, my pointing this out will help everyone, my intention isn't to hurt you but help you realize a potentially better way to get along with everyone.

> made it bluntly
> clear, as part of reneging on the agreement back in april, that this
> decision was made.
> 
> whether the code *can* support full abstraction (which with a mere
> 420-line diff, it can), whitequark unilaterally decided that that was
> completly irrelevant.

Well, I'm saying your 90% of the way there, but the last 10% of abstraction is the harder part...it's easier and more consistent to just have a separate API that mirrors nmigen's API, allowing unmodified nmigen to work with our code, as well as making our code much clearer. This wouldn't require that much effort, since we can use most of the existing PartitionedSignal code with minor modifications and renaming.

> > whereas
> > SimdSignal acts like a *list* of scalar values -- aka. a Vector.
> 
> yes.  and therefore the entire suite of operators of SimdSignal can be
> *defined* to act in list-form, at every single level in the entirety of ast.*
> operations.

I'm saying the existing design only does a 90% job, it covers list-form +, -, *, &, ^, |, .eq, etc.

It does *not* cover Shape.width, Signal.like, etc.

I'm saying we should finish off getting to 100% of the API looking like a transparently vectorized (aka SIMT) version of nmigen Shape/Signal/Value/etc., just with different names, since that will make it much easier to use/understand. Note that Shape.width (lane width) not being a plain integer comes from lanes being XLEN-width, not from being vectorized. those are orthogonal unrelated changes. We could have a totally scalar Signal/Value/etc. that are all XLEN-ified, that would require modifying Shape.width to be multivalued. We could also vectorize our totally 64-bit-only code, that would leave Shape.width == 64 (since that's the *element* width), even if a vector was 256-bits wide.

+----------------+---------------+-------------------------------------------+
| XLEN-ification | Vectorization |                                           |
+================+===============+===========================================+
| No             | No            | Original Scalar code;                     |
|                |               | Shape.width == 64                         |
+----------------+---------------+-------------------------------------------+
| No             | Yes           | SimdShape.width==64,                      |
|                |               | SimdShape.simd_width==128                 |
+----------------+---------------+-------------------------------------------+
| Yes            | No            | XLShape.width==XLEN==                     |
|                |               |     XLMap({I8:8,I16:16,I32:32,I64:64})    |
+----------------+---------------+-------------------------------------------+
| Yes            | Yes           | SimdShape.width==XLEN==                   |
|                |               |     SimdMap({I8:8,I16:16,I32:32,I64:64}), |
|                |               | SimdShape.simd_width==128                 |
+----------------+---------------+-------------------------------------------+

> have you run the unit test and / or contributed to the individual modules?

Yes, I have -- don't forget that I'm the one who originally wrote the PartitionedAdd and PartitionedMultiply -- the inspiration for this whole thing.

Comment 88 Jacob Lifshay 2021-10-12 22:34:21 BST

(In reply to Luke Kenneth Casson Leighton from comment #86)
> (In reply to Jacob Lifshay from comment #83)
> > yup, because I *wanted* lanes to be 5, 6, 12, and 24 bits wide, not 5, 6, 6,
> > and 6. That was to demonstrate how the part width is calculated to fit the
> > minimum required bits, which can be less than a lane's width, because that
> > lane occupies multiple parts.
> 
> ok.  so now we come back to comment #75, that by defining widths in terms
> of "lanes" is the wrong thing to do.
> 
> what you are calling "lane" is "part_count" times "element width".

no, I'm explicitly not. I explicitly stated the definition of "lane" i'm using is a vector element. a part is not an element, it is a fixed number of bits (doesn't change with different elwid). a vector element + it's padding is made of a power of two number of parts, that number of parts is "part_count".
> 
> what is *actually needed* is just, "element width".

yes, exactly (once you add in padding). I've been saying that the whole time. 
> 
> this is because when converting Scalar code to SIMD, it is necessary
> to allow a Signal of width e.g. 16 to be multiplied by the number
> of partitions *automatically* when that Signal class is contextually-replaced
> with SimdSignal class (via a compile-time switch).
> 
> if you force people to put what you call a "lane" width in,
> that Signal width of 16 will become subdivided by 2 for elwidth=32,
> subdivided by 4 for elwidth=16 and subdivided by 8 for elwidth=8
> 
> how can you fit a 16-bit computation into 2-bit data when elwidth=8?
> 
> you can't, can you?

Of course you can't if you *insist* on the SimdSignal constructor taking the full-simd-width. It works great, by contrast, if the SimdSignal constructor takes the *element width*, as I've been advocating for for a while.

> this is the desired behaviour
> 
> elwidth times partition count is not.
> 
> is this clear, now?

Yup totally!

Comment 89 Jacob Lifshay 2021-10-13 00:50:35 BST

After talking it over in today's virtual coffee meeting, we agreed that we should work on finishing the existing design, and leave improvements for later.

I do think that having just the XLEN global idea will save us a huge amount of time, so I'm going to try to build a version with just that added, and leave the rest mostly unmodified.

Comment 90 Luke Kenneth Casson Leighton 2021-10-13 01:05:03 BST

(In reply to Jacob Lifshay from comment #88)

> Of course you can't if you *insist* on the SimdSignal constructor taking the
> full-simd-width. It works great, by contrast, if the SimdSignal constructor
> takes the *element width*, as I've been advocating for for a while.

brief: (1am), both are needed.  see comment #63.

(In reply to Jacob Lifshay from comment #89)

> I do think that having just the XLEN global idea will save us a huge amount
> of time, 

yes.  but only if its class is SimdShape which derives from Shape.

self.XLEN=SimdShape(64, part_count={0:1, 1:2, 2:4, 3:8})

where XLEN//2 returns a new SimdShape with overall width=32,
partition counts the same, and vector element wifths halved (//2)
at all elwidths.

a different mode is needed for when a SimdShape is declared WITHOUT
a fixed width, but instead with fixed vector element widths.

in that mode the arithmetic operations are done *on the vector element
widths*.

two modes.

this is essential.


> so I'm going to try to build a version with just that added, and
> leave the rest mostly unmodified.

make it SimdShape.  derive from Shape.  recalculate layout based on
vector element width being passed in.

but first the bug in layout() needs to be fixed.

Comment 91 Jacob Lifshay 2021-10-13 01:10:25 BST

(In reply to Luke Kenneth Casson Leighton from comment #90)
> (In reply to Jacob Lifshay from comment #88)
> 
> > Of course you can't if you *insist* on the SimdSignal constructor taking the
> > full-simd-width. It works great, by contrast, if the SimdSignal constructor
> > takes the *element width*, as I've been advocating for for a while.
> 
> brief: (1am), both are needed.  see comment #63.

yup, both are needed, but full-simd-width is passed in through a scoped-global, not through SimdSignal.__init__ arguments.
> 
> (In reply to Jacob Lifshay from comment #89)
> 
> > I do think that having just the XLEN global idea will save us a huge amount
> > of time, 
> 
> yes.  but only if its class is SimdShape which derives from Shape.

wait...I am planning on XLEN being a type that can be passed into SimdShape, but it isn't a Shape or a SimdShape -- it acts like an integer, not a Shape.

Comment 92 Jacob Lifshay 2021-10-13 01:11:03 BST

(In reply to Luke Kenneth Casson Leighton from comment #90)
> but first the bug in layout() needs to be fixed.

yeah yeah, i'll work on that first.

Comment 93 Luke Kenneth Casson Leighton 2021-10-13 01:46:08 BST

(In reply to Jacob Lifshay from comment #91)
> yup, both are needed, but full-simd-width is passed in through a
> scoped-global, not through SimdSignal.__init__ arguments.

no. won't work.  parameter to SimdShape.  basically parameters to
layout() become parameters to SimdShape.

full_simd_width named parameter defaults to None, this is fixed_width
param of layout().  gets put into Shape.width directly.

SimdShape constructor 1st arg, if int, this is vector element width.
if dict, is the layout elwidths thing.
actual width is computed and set in Shape.width


> > 
> > (In reply to Jacob Lifshay from comment #89)
> > 
> > > I do think that having just the XLEN global idea will save us a huge amount
> > > of time, 
> > 
> > yes.  but only if its class is SimdShape which derives from Shape.
> 
> wait...I am planning on XLEN being a type 

too much focus on "types", transferring concepts from rust not
appropriate.

XLEN as a SimdShape, fixed_width=64.

see url at top of bugreport.

Comment 94 Jacob Lifshay 2021-10-13 02:01:39 BST

(In reply to Luke Kenneth Casson Leighton from comment #93)
> (In reply to Jacob Lifshay from comment #91)
> > wait...I am planning on XLEN being a type 
> 
> too much focus on "types", transferring concepts from rust not
> appropriate.

my point is that the class that XLEN is an instance of, *should* be general enough to use for non-integers (since it would be handy for things like m.Case() patterns, and other places). If we split out the arithmetic from SimdShape into the separate class (SimdMap), then it can handle all the arithmetic, patterns, etc.

I'm going to try it out as a separate class, and you can see what you think once it's in there...if it doesn't work, then we can merge it back into SimdShape pretty easily.

Comment 95 Luke Kenneth Casson Leighton 2021-10-13 03:51:42 BST

(In reply to Jacob Lifshay from comment #94)
> (In reply to Luke Kenneth Casson Leighton from comment #93)
> > (In reply to Jacob Lifshay from comment #91)
> > > wait...I am planning on XLEN being a type 
> > 
> > too much focus on "types", transferring concepts from rust not
> > appropriate.
> 
> my point is that the class that XLEN is an instance of, *should* be general
> enough to use for non-integers (since it would be handy for things like
> m.Case() patterns, and other places).

not in any existing ALU (there's 12)

all Case()s involve operand fields.  case extsb,
case extsw, case extsh etc.

all scalar.


> If we split out the arithmetic from
> SimdShape into the separate class (SimdMap), then it can handle all the
> arithmetic, patterns, etc.
> 
> I'm going to try it out as a separate class, and you can see what you think
> once it's in there...if it doesn't work, then we can merge it back into
> SimdShape pretty easily.

ok.  just remember, it needs 2 completely different behaviours:

* fixed_width priority (recalculates laneshapes aka vector element widths)
* vector element widths priority (recalculates overall width aka Shape.width
  aka SimdShape.width)

basically calls layout() to recalculate.

dont go overboard.  add, mul, div, sub, is probably about it.
operands int more than enough.

layout needs fixing first otherwise recalc wont work.

Comment 96 Jacob Lifshay 2021-10-13 04:50:14 BST

(In reply to Luke Kenneth Casson Leighton from comment #95)
> layout needs fixing first otherwise recalc wont work.

Fixed:
https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=6ddea00d11bbf3301c9266ec10bc5ac1ea693973;hb=9a318256b74054b8d592efe7be298764d0de415a

I also added a bunch of docs to layout()

Comment 97 Jacob Lifshay 2021-10-13 09:53:35 BST

I added XLEN, and refactored layout into a SimdLayout class.

(In reply to Luke Kenneth Casson Leighton from comment #95)
> dont go overboard.  add, mul, div, sub, is probably about it.
> operands int more than enough.

umm...well lets just say you won't have to worry about missing some methods...

maybe I should have put it in an alternate branch...

Comment 98 Luke Kenneth Casson Leighton 2021-10-13 12:26:07 BST

(In reply to Jacob Lifshay from comment #97)
> I added XLEN, and refactored layout into a SimdLayout class.

as a 957-line diff.  remember i said that the Project Development Policy
is to do small incremental commits?

  git diff 4fca5d | wc
    957    4486   36684


> (In reply to Luke Kenneth Casson Leighton from comment #95)
> > dont go overboard.  add, mul, div, sub, is probably about it.
> > operands int more than enough.
> 
> umm...well lets just say you won't have to worry about missing some
> methods...

this will take several days for me to review and understand.
far from moving the project forward when we are under time pressure,
that unfortunately delays it even more.

> maybe I should have put it in an alternate branch...

it's not the branch, it's that it's such a massive diff that it
stops me from being able to understand it.

ok.

what to do.

what have i seen other project lead developers do in the past...

i know: yes, it needs reversion, and moving to a branch, and to
create a series of "planned" (incremental) patches.

1) add SimdMap and util.py but do *not* modify layout_experiment.py
2) fix the bugs in layout_experiment.py (and leave in the accidental
   example where the layout was {0:5 1:6 2:6 3:6})
3) update layout_experiment.py to a class but do *not* get it to use
   SimdMap (because it hasn't been reviewed / authorised)
4) then we join the two together

also, leave in the dots:

+    # elwid=F64 1x 24-bit    |<--------i24-------->|
+    # elwid=F32 2x 12-bit    |<---i12--->|<---i12--->|
+    # elwid=F16 4x 6-bit     |<-i6>|<-i6>|<-i6>|<-i6>|
+    # elwid=BF16 4x 5-bit    |.<i5>|.<i5>|.<i5>|.<i5>|

what you'd done in the (957-line diff) was remove the examples which
i'd spent some time analysing and understanding.

i _was_ going to add Asserts which would (eventually form the basis
of a proper unit test) show that layout is producing the wrong answer

because you have made *two* changes combined into one, it is now
very difficult to illustrate to you what i mean.

you changed this:

    widths_at_elwidth = {
        0: 5,
        1: 6,
        2: 12,
        3: 24
    }

to this:

    widths_at_elwidth = {
        FpElWid.F64: 24,
        FpElWid.F32: 12,
        FpElWid.F16: 6,
        FpElWid.BF16: 5,
    }

(and likewise part_count)

     part_counts = {
-        0: 1,
-        1: 1,
-        2: 2,
-        3: 4,
+        FpElWid.F64: 4,
+        FpElWid.F32: 2,
+        FpElWid.F16: 1,
+        FpElWid.BF16: 1,
     }

which is *two* changes.

1) changing of the actual vector element widths
2) changing of the specification of elwidth from int type to FpElwid type

both of these compounded changes now make it extra work to explain.

in the existing layout_experiment.py (as of yesterday), the following
was done:

    widths_at_elwidth = {
        0: 5,    # not divided --> vec element = 5
        1: 6,    # not divided --> vec element = 6
        2: 12,   # gets divided by 2 --> vec element = 6
        3: 24    # gets divided by 4 --> vec element = 6
    }


oh s*** i think i now know what you were talking about, finally.

+        FpElWid.F64: 4,
+        FpElWid.F32: 2,
+        FpElWid.F16: 1,
+        FpElWid.BF16: 1,

you've still specified F64 as comprising *four* partitions.

the requirement is to *NOT* specify the number of *BYTES*.

the requirement is to specify the NUMBER OF ELEMENTS.

we *require* part_count to be:

     part_counts = {
        0: 1, # at elwidth=0b00, the number of vector elements = 1
        1: 1, # at elwidth=0b01, the number of vector elements = 1
        2: 2, # at elwidth=0b10, the number of vector elements = 2
        3: 4, # at elwidth=0b11, the number of vector elements = 4
     }

**NOT** "the total number of bytes at elwdith=0b0 shall be one (1)

much as i love the documentation in layout_experiment.py, unfortunately
because it is an "all-or-nothing" commit, the entire lot needs
reversion (i'll handle it).

util.py can stay as it is.

i really do love the documentation in layout_experiment.py btw.
once reverted i'll see if i can extract it.

Comment 99 Luke Kenneth Casson Leighton 2021-10-13 13:16:36 BST

ok that's done.  i was able to keep the documentation / glossary
because it was prior to the conversion to a class.

now.

i finally understood that you've been specifying part_count as
the number of bytes.

this will restrict the layout and create too many partition points
(14 rather than 5 in the case of the 5-5-6-6 example).

i've *partly* fixed that with a 1-line commit:

commit 6a50b2d1649fbf1f22be783eaa5ebbb9cdf72f64 (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Oct 13 12:51:09 2021 +0100

    redefine part_counts to be "number of vector elements in a partition"
    fixes the bug

it removes the divide operation:

-    cpart_wid = [-lane_shapes[i] // c for i, c in part_counts.items()]
+    cpart_wid = [-lane_shapes[i] for i, c in part_counts.items()]

therefore, actually, cpart_wid probably becomes just max(lane_shapes.values())

-    cpart_wid = [-lane_shapes[i] for i, c in part_counts.items()]
-    print("cpart_wid", cpart_wid, "part_counts", part_counts)
-    cpart_wid = -min(cpart_wid)
+    print("lane_shapes", lane_shapes, "part_counts", part_counts)
+    cpart_wid = max(lane_shapes.values())

yep, that works.

now, next thing: fix the documentation as well as rename part_counts to...
to... vec_el_counts so as to avoid this confusion.

vec_el_counts is "the number of vector elements required at each elwidth"

... done

commit 74ec642c0f4eb67a80b785a191ae11875416300a (HEAD -> master, origin/master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Oct 13 13:06:25 2021 +0100

    rename part_counts to vec_el_counts
    to indicate that it is intended to be the count of the number of vector
    elements within a partition

so now there is just one bug remaining:

5,6,6,6 elements {0: 5, 1: 6, 2: 6, 3: 6}
lane_shapes {0: 5, 1: 6, 2: 6, 3: 6} vec_el_counts {0: 1, 1: 1, 2: 2, 3: 4}
width 24 6 4
dpoints
{5: [0],
 6: [0, 1, 1, 2, 3],
 11: [0],
 12: [0, 1, 1, 2],
 17: [0],
 18: [0, 1, 1, 2],
 23: [0]}

the bug is: 11, 17 and 23 should *not* have partition points.  there
should only be one element added at elwidth==0, but instead *FOUR*
are being added.

* when elwdith=0 the vec_el_count is 1 so one is being added from 0-5
* but there is also one being added from 6-11
* and another from 12-17
* and another from 18-23

also, i see only one when elwidth==3.

so it is inverted.  i will see if i can fix that.

Comment 100 Luke Kenneth Casson Leighton 2021-10-13 13:27:58 BST

that's it.  that does the job.

     for i, c in vec_el_counts.items():
+        # calculate part_wid based on overall width divided by number
+        # of elements.
+        part_wid = width // c
         def add_p(p):
             dpoints[p].append(i)  # auto-creates list if key non-existent
-        for start in range(0, part_count, c):
+        # for each elwidth, create the required number of vector elements
+        for start in range(c):

commit a52b594062be6fd16d37b1f906904720bdda3525 (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Oct 13 13:25:09 2021 +0100

    fix layout() to put in only the number of *requested* vector elements


5,6,6,6 elements {0: 5, 1: 6, 2: 6, 3: 6}
lane_shapes {0: 5, 1: 6, 2: 6, 3: 6} vec_el_counts {0: 1, 1: 1, 2: 2, 3: 4}
width 24 6 4
dpoints
{5: [0], 6: [1, 2, 3, 3], 12: [2, 3, 3], 18: [2, 3, 3]}


HA!  excellent.  that's the right answer.

only 4 breakpoints are created, not 14.

 211     # specify that the Vector Element lengths are to be *different* at
 212     # each of the elwidths.
 213     # combined with vec_el_counts we have:
 214     # elwidth=0b00 1x 5-bit     | <--  unused               -->....5 |
 215     # elwidth=0b01 1x 6-bit     | <--  unused              -->.....6 |
 216     # elwidth=0b10 2x 12-bit    | unused   .....6 | unused    .....6 |
 217     # elwidth=0b11 3x 24-bit    | .....6 | .....6 |  .....6 | .....6 |
 218     # expected partitions      (^)       ^        ^         ^^      (^)
 219     # to be at these points:   (|)       |        |         ||      (|)
                                             18       12         65

the breakpoints at 5, 6, 12 and 18 are now correctly created.
i'll add an assert to that effect.

(edit:

commit 93cc0c33f0177226ee156a7623642335d0a74e71 (HEAD -> master, origin/master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Oct 13 13:36:47 2021 +0100

    add assert to check that the 5-6-6-6 example returns the expected
    partition points 5,6,12,18

)

Comment 101 Luke Kenneth Casson Leighton 2021-10-13 15:25:06 BST

commit 27003c0bfe781d4da60a6eff4e26875f4b923f68 (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Oct 13 15:23:03 2021 +0100

    remove signed.  again

i've removed the signed argument from layout().  again.

please don't add it in again.  it is in absolutely no way relevant
to the computations going on inside layout() and it is a major
distraction.

if the widths were in any way altered *by* the sign *then and only then*
would it be relevant to pass it in as a parameter.

please *only* put in what is *necessary* to each function.  sign
is *never* going to be relevant to this function.

Comment 102 Luke Kenneth Casson Leighton 2021-10-13 15:29:37 BST

this example is going wrong:

    # fixed_width=32 and no lane_widths says "allocate maximum"
    # i.e. Vector Element Widths are auto-allocated
    # elwidth=0b00 1x 32-bit    | .................32 |
    # elwidth=0b01 1x 32-bit    | .................32 |
    # elwidth=0b10 2x 12-bit    | ......16 | ......16 |
    # elwidth=0b11 3x 24-bit    | ..8| ..8 | ..8 |..8 |
    # expected partitions      (^)   |     |     |   (^)
    # to be at these points:   (|)   |     |     |    |

output:

maximum allocation from fixed_width=32
lane_shapes {0: 32, 1: 32, 2: 16, 3: 8}
vec_el_counts {0: 1, 1: 1, 2: 2, 3: 4}
width 128 32 4

what is happening is:

    cpart_wid = max(lane_shapes.values())     ==> 32
    part_count = max(vec_el_counts.values())  ==> 4

therefore 4x32 == 128 which is ridiculous and completely wrong,
the answer (as can be seen from the manual analysis) should be
32.

a first cut at a "fix" for this is to do something along
the lines of:

    cpart_wid = 0
    for i, ls in lane_shape.items():
        cpart_wid = max(cpart_wid, width // ls)

something like that.

i'll try to work through it.

Comment 103 Luke Kenneth Casson Leighton 2021-10-13 15:44:38 BST

cpart_wid = max(lane_shapes.values())
width = cpart_wid * part_count

replaced with:

+    cpart_wid = 0
+    width = 0
+    for i, lwid in lane_shapes.items():
+        required_width = lwid * vec_el_counts[i]
+        print("     required width", cpart_wid, i, lwid, required_width)
+        if required_width > width:
+            cpart_wid = lwid
+            width = required_width

commit 316c02716c365eee298ca3da5a65c3be58b57a1e (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Oct 13 15:44:19 2021 +0100

    fix issue where width was being computed based on 2 maximum values
    actually needed is to multiply the number of elements by the width of
    an element and use that to determine which is greater

Comment 104 Luke Kenneth Casson Leighton 2021-10-13 16:18:29 BST

ok, added this example:

    # example "exponent"
    #  https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/
    # 1xFP64: 11 bits, one exponent
    # 2xFP32: 8 bits, two exponents
    # 4xFP16: 5 bits, four exponents
    # 4xBF16: 8 bits, four exponents
    vec_el_counts = {
        0: 1,  # QTY 1x FP64
        1: 2,  # QTY 2x FP32
        2: 4,  # QTY 4x FP16
        3: 4,  # QTY 4x BF16
    }
    widths_at_elwidth = {
        0: 11, # FP64 ew=0b00
        1: 8,  # FP32 ew=0b01
        2: 5,  # FP16 ew=0b10
        3: 8   # BF16 ew=0b11
    }

which is returning these results: dict_keys([5, 8, 11, 13, 16, 21, 24, 29])

which doesn't match expected:

    # expected results:
    #
    #        |31|  |  |24|     16|15  |  |   8|7     0 |
    #        |31|28|26|24| |20|16|  12|  |10|8|5|4   0 |
    #  32bit | x| x| x|  |      x|   x| x|10 ....    0 |
    #  16bit | x| x|26    ... 16 |   x| x|10 ....    0 |
    #  8bit  | x|28 .. 24|  20.16|   x|11 .. 8|x|4.. 0 |
    #  unused  x                     x

drat.

i don't know why.  it's calculating the correct (expected)
width of 32.  unsure what's going on, here.

Comment 105 Luke Kenneth Casson Leighton 2021-10-13 17:58:21 BST

i added another example which i *think* was what the (original) test
was supposed to be, from comment #20

    l = {0: signed(5), 1: signed(6), 2: signed(12), 3: signed(24)}

i set the following vector element widths:

+    widths_at_elwidth = {
+        0: 24,  # QTY 1x 24
+        1: 12,  # QTY 1x 12
+        2: 5,   # QTY 2x 5
+        3: 6    # QTY 4x 6
+    }

and it produced reasonable expected results.

commit 9ea45dcbddc385f394e2c86c4dd4f2a5b98686b9 (HEAD -> master, origin/master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Oct 13 17:55:49 2021 +0100

    create quick test of what 24-12-5-6 layout was likely-expected to be

Comment 106 Jacob Lifshay 2021-10-14 03:38:25 BST

I read through all the new layout() code, looks mostly fine, though afaict the computation of bmask is incorrect if the object was to figure out which partitions are always padding.

Comment 107 Luke Kenneth Casson Leighton 2021-10-14 09:36:47 BST

(In reply to Jacob Lifshay from comment #106)
> I read through all the new layout() code, looks mostly fine, though afaict
> the computation of bmask is incorrect if the object was to figure out which
> partitions are always padding.

yes, i can't quite work it out. if you can fix that, these algorithmic
things do my head in.

Comment 108 Jacob Lifshay 2021-10-14 09:39:47 BST

(In reply to Luke Kenneth Casson Leighton from comment #107)
> (In reply to Jacob Lifshay from comment #106)
> > I read through all the new layout() code, looks mostly fine, though afaict
> > the computation of bmask is incorrect if the object was to figure out which
> > partitions are always padding.
> 
> yes, i can't quite work it out. if you can fix that, these algorithmic
> things do my head in.

yeah, I can do that tomorrow (actually today, but you get what I meant).

Comment 109 Luke Kenneth Casson Leighton 2021-10-14 10:04:49 BST

(In reply to Jacob Lifshay from comment #108)

> yeah, I can do that tomorrow (actually today, but you get what I meant).

:)

the blanking mask is important to allow submodules
to not allocate unnecessary gates.

recap:

* with the width(s) now being the vector element width, intermediary
  scalar computations "mb = Signal(6)" for e.g. rlwinm will now
  be straight-SIMDified without any other code changes other than
  s/Signal/context.SignalKls"

   with ctx:
      mb = ctx.SignalKls(6) # 6=> lane_shapes => {0:6,1:6,2:6,3:6}

* by making XLEN a SimdShape it becomes possible to "notify"
  ctx.SignalKls (when that class is set equal to SimdSignal),
  "i do not want you to allocate something that is a vector
   element width of XLEN, i want you to allocate a *fixed width*
   SIMD Signal"

   class SimdShape(Shape):
        __init__(self, lane_shapes=None,
                       signed,
                       fixed_width=None)

        detect type (fixed, laneshapes) and store that

* by having SimdShape itself store the detected type (fixed type
  or laneshape type) when arithmetic is performed the layout()
  function may be re-called to re-compute the points, dpoints,
  pbit, blanking mask etc.

  thus ss + 5 when in laneshape mode will add five to ALL
  of the laneshape Vector Element lengths (!!!) and as
  as result the overall width will go up by (like) four
  TIMES five

  but when added in "fixed" mode it should add five to the
  *whole* length then recompute the Vector Lane widths
  using the partition counts.  obviously here if the
  programmer fails to take into account the partition
  allocation some of the computed Vector Element widths
  will not fit (XLEN=64 +3 will be 67 which is not
  divisible by 8) but that is their problem to sort out

* with the arithmetic being *dual purpose* and operating on
  laneshapes it makes no sense to have XOR or AND etc operators,
  (which is why i said "rudimentary"), what does it mean to
  OR in bits into a variable length completely different set
  of Vector lengths?

  it does however make sense to support divide multiply shift
  add and subtract (by integers) but not much else, and an
  exception should be thrown.

* unless the arithmetic is part of SimdShape the remapping
  (calling of layout()) cannot occur hence why a separate
  class is not really useful

* with only rudimentary arithmetic making sense it is not
  a general purpose class it is highly specialised and limited
  and again does not make sense to be separate.

Comment 110 Luke Kenneth Casson Leighton 2021-10-14 11:02:31 BST

(In reply to Jacob Lifshay from comment #87)
 
> And I agree with her there, nmigen has always been scalar,

not being funny:
stating that you agree with whitequark is actually very dangerous
for the project.  it will be used as ammunition to sabotage what we are
doing.

> where each Value
> is a single value, not a list of values. Basically everybody expects
> nmigen's Value to be a scalar. If you insist on rewriting nmigen to change
> that fundamental assumption, you'll end up with a different language, 

no, false.  did we design totally new Power ISA Scalar v3.0B because of
SVP64?

this is the exact same thing.

the entirety of the nmigen language *becomes* vectorised...
WITHOUT CHANGING THAT LANGUAGE IN ANY WAY


please this is very frustrating that you do not get this or accept it
and keep on advocating massive bodies of unnecessary work as a result.



> that
> is called nmigen only because you are confusing your fundamental
> assumption-breaking 

no confusion. no assumptions are broken.  please study the RFC and read the
code until you undertand properly.

if you do not understand please ask questions instead of continuing
to advocate alternative designs.



> oh, sad and surprised to hear that...I'll wait a while and then see if I can
> convince whitequark to let you back,

i will respond on list

> > made it bluntly
> > clear, as part of reneging on the agreement back in april, that this
> > decision was made.
> > 
> > whether the code *can* support full abstraction (which with a mere
> > 420-line diff, it can), whitequark unilaterally decided that that was
> > completly irrelevant.
> 
> Well, I'm saying your 90% of the way there, but the last 10% of abstraction
> is the harder part...

it's actually been pretty easy, just a hell of a lot of work (and to
be honest i am annoyed that, again, you haven't helped with Cat, Repl,
Part, or any of the recent additions, which i asked you to help with)

there's over THIRTY operators.  i really needed your help in implementing
them and you have instead spent considerable time snd resources fighting
every single step of the way to *not* help.  i do not know why.
leaving that aside for you to think about


there are two REALLY complex ones:

* Array (and ArrayProxy).

* ast.Switch aka SimdSignal.__Switch__.

i did the analysis, under bug #458, and whoo, even just the analysis
takes a hell of a lot of explaining.

unfortunately because ast.Switch is the basis of m.If, m.FSM and
m.Switch, i can't provide "full proof" until it's actually 100%
implemented.

i need 1 clear week in which to implement it.

i could have had it finished over 10 days ago if you had not kept
fighting me with alternative designs that i have already documented
and explained would be catastrophic failures, and had cooperated
and helped with Part, Cat, Repl etc.

if you would like to rectify that you can help implement the
__Slice__ override.



> it's easier and more consistent to just have a separate
> API that mirrors nmigen's API, allowing unmodified nmigen to work with our
> code, as well as making our code much clearer. This wouldn't require that
> much effort, since we can use most of the existing PartitionedSignal code
> with minor modifications and renaming.

it results in a massive duplication and msintenance burden.  please
read the RFC, i documented clearly why what you advocate would be
disastrous in multiple ways.

last section:

https://libre-soc.org/3d_gpu/architecture/dynamic_simd/


> 
> > > whereas
> > > SimdSignal acts like a *list* of scalar values -- aka. a Vector.
> > 
> > yes.  and therefore the entire suite of operators of SimdSignal can be
> > *defined* to act in list-form, at every single level in the entirety of ast.*
> > operations.
> 
> I'm saying the existing design only does a 90% job, it covers list-form +,
> -, *, &, ^, |, .eq, etc.
> 
> It does *not* cover Shape.width, Signal.like, etc.
> 
> I'm saying we should finish off getting to 100% of the API looking like a
> transparently vectorized (aka SIMT) 

SIMT i keep telling you is the wrong word to use.  you mean SIMD

> version of nmigen
> Shape/Signal/Value/etc.,

jacob: please listen to what i am telling you.  what you envisage is EXACTLY
what is being designed.

> just with different names, since that will make it
> much easier to use/understand. Note that Shape.width (lane width) not being
> a plain integer comes from lanes being XLEN-width, not from being
> vectorized. those are orthogonal unrelated changes. We could have a totally
> scalar Signal/Value/etc. that are all XLEN-ified, that would require
> modifying Shape.width to be multivalued.

yes.  that is what lane_shapes is for.

> We could also vectorize our totally
> 64-bit-only code, that would leave Shape.width == 64 (since that's the
> *element* width), even if a vector was 256-bits wide.

> > have you run the unit test and / or contributed to the individual modules?
> 
> Yes, I have -- don't forget that I'm the one who originally wrote the
> PartitionedAdd and PartitionedMultiply -- the inspiration for this whole
> thing.

i know, it's brilliant.

i meant, the more recent ones.  Assign, Cat, Repl, and the other (older) ones __eq__ etc.

did you see that there are unit tests which show that scalar-vector interaction
pass and function perfectly?

Signal.like() i haven't got round to doing a unit test yet, i expect it to
work 100% because UserValue is designed specifically to allow exactly that,
through the Uservalue.lower() function.

i know exactly what the requirements are having been working on this for
over 18 months and if i am honest i am getting quite annoyed that you
keep telling me "what you have designed is a failure and clearly can't work"
rather than taking the time to understand *how* it works, when i keep
telling you it does, in fact, actually work.

Comment 111 Jacob Lifshay 2021-10-14 11:22:42 BST

(In reply to Luke Kenneth Casson Leighton from comment #109)
> (In reply to Jacob Lifshay from comment #108)
>   but when added in "fixed" mode it should add five to the
>   *whole* length then recompute the Vector Lane widths
>   using the partition counts.  obviously here if the
>   programmer fails to take into account the partition
>   allocation some of the computed Vector Element widths
>   will not fit (XLEN=64 +3 will be 67 which is not
>   divisible by 8) but that is their problem to sort out

I'd argue strongly that we should specifically not do that, what we should do instead is mirror how XLEN works in the OpenPower spec pseudo-code, where XLEN + 5 means that each per-vector-element instance ends up with types 5 bits wider, so:

elwid=8: XLEN + 5 -> element width of 8 + 5 == 13
elwid=16: XLEN + 5 -> element width of 16 + 5 == 21
elwid=32: XLEN + 5 -> element width of 32 + 5 == 37
elwid=64: XLEN + 5 -> element width of 64 + 5 == 69

The above is the semantics I explicitly rely on in the pseudo-code for addg6s (where it creates values with a width of XLEN + 4, because it needs 4 extra bits to prevent the logic from overflowing), we should not make SimdSignal's XLEN behave differently just because you like doing arithmetic on the full SIMD width.

Comment 112 Jacob Lifshay 2021-10-14 11:42:58 BST

(In reply to Jacob Lifshay from comment #111)
> (In reply to Luke Kenneth Casson Leighton from comment #109)
> > (In reply to Jacob Lifshay from comment #108)
> >   but when added in "fixed" mode it should add five to the
> >   *whole* length then recompute the Vector Lane widths
> >   using the partition counts.  obviously here if the
> >   programmer fails to take into account the partition
> >   allocation some of the computed Vector Element widths
> >   will not fit (XLEN=64 +3 will be 67 which is not
> >   divisible by 8) but that is their problem to sort out
> 
> I'd argue strongly that we should specifically not do that, what we should
> do instead is mirror how XLEN works in the OpenPower spec pseudo-code, where
> XLEN + 5 means that each per-vector-element instance ends up with types 5
> bits wider, so:
> 
> elwid=8: XLEN + 5 -> element width of 8 + 5 == 13
> elwid=16: XLEN + 5 -> element width of 16 + 5 == 21
> elwid=32: XLEN + 5 -> element width of 32 + 5 == 37
> elwid=64: XLEN + 5 -> element width of 64 + 5 == 69
> 
> The above is the semantics I explicitly rely on in the pseudo-code for
> addg6s (where it creates values with a width of XLEN + 4, because it needs 4
> extra bits to prevent the logic from overflowing), we should not make
> SimdSignal's XLEN behave differently just because you like doing arithmetic
> on the full SIMD width.

After thinking a bit, the actual issue is that element-widths are the thing arithmetic needs to happen on -- you can come up with whatever algorithm you like for deciding what full simd width to use, as long as the element widths take priority.

Comment 113 Luke Kenneth Casson Leighton 2021-10-14 12:34:15 BST

(In reply to Jacob Lifshay from comment #112)

> After thinking a bit, the actual issue is that element-widths are the thing
> arithmetic needs to happen on

yes, good, you understand now.

> -- you can come up with whatever algorithm you
> like for deciding what full simd width to use,
> as long as the element widths
> take priority.

(that algorithm is: layout())

there are actually two separate and distinct cases.

1) the add6gs example you gave (which is the same as the rwlwinm example)

2) the times where the whole SIMD Signal is defined by the register file
   width (64 bit fixed width)

(2) will come *in* from the definition of the regspecs.

yes the priority is definitely (1) - to be honest we can probably get
away with not having (2) for now and see how it goes.

edit: but you see now, why because the arithmetic needs to be done
simultaneously on *all* vector element widths at *all* elwidths,
only rudimentary arithmetic can be performed?

bitwise OR of 0x9 onto a simultaneous vector width of 8 and 16 makes
absolutely no sense.

Comment 114 Luke Kenneth Casson Leighton 2021-10-14 20:01:00 BST

(In reply to Jacob Lifshay from comment #112)
> After thinking a bit, the actual issue is that element-widths are the thing
> arithmetic needs to happen on -- you can come up with whatever algorithm you
> like for deciding what full simd width to use, as long as the element widths
> take priority.

with likewise sone thought, tending to agree with you, here.

the fixed_width scenario, after creating the
dictionary of Vector Element Widths (vec_el_widths in layout())
i *think* that from that point onwards it's ok to do sums
on those widths (basic ones!)

if people really really really want to do things based on
fixed_width then, well, they declare a new SimdShape.

this is based on the (preliminary) premise that if you
set a fixed_width of say 64, and combined that with part_count of
{1,2,4,8} to auto-create vec_el_widths of {64,32,16,8}
and you add say 2 to that (to create room for carry-in
and carry-out in the ADD ALU for example) you should reasonably
expect a new vec_el_widths of {64+2,32+2,16+2,8+2}

this *even though* you initially created the SimdShape with
a fixed_width.

i updated the url btw.
https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/?updated

Comment 115 Jacob Lifshay 2021-10-14 22:05:24 BST

(In reply to Luke Kenneth Casson Leighton from comment #110)
> (In reply to Jacob Lifshay from comment #87)
>  
> > And I agree with her there, nmigen has always been scalar,
> 
> not being funny:
> stating that you agree with whitequark is actually very dangerous
> for the project.  it will be used as ammunition to sabotage what we are
> doing.
> 
> > where each Value
> > is a single value, not a list of values. Basically everybody expects
> > nmigen's Value to be a scalar. If you insist on rewriting nmigen to change
> > that fundamental assumption, you'll end up with a different language, 
> 
> no, false.  did we design totally new Power ISA Scalar v3.0B because of
> SVP64?

nope, but SVP64 instructions *are* encoded differently (they have a svp64 prefix), which is just like the renaming i'm advocating for. By renaming, that could be something like:
from simd_nmigen import Signal, Module, Value, Shape
> 
> this is the exact same thing.
> 
> the entirety of the nmigen language *becomes* vectorised...
> WITHOUT CHANGING THAT LANGUAGE IN ANY WAY

This only works if *every* property of our new types that nmigen inspects (including Shape.width and Signal's inputs being `Shape.cast`ed) behaves exactly like how nmigen expects, which isn't the case currently since you picked Shape.width to be full-simd-width instead of element width, among other discrepancies.

> > Well, I'm saying your 90% of the way there, but the last 10% of abstraction
> > is the harder part...
> 
> it's actually been pretty easy, just a hell of a lot of work (and to
> be honest i am annoyed that, again, you haven't helped with Cat, Repl,
> Part, or any of the recent additions, which i asked you to help with)

well, actually, I wrote Cat and Repl implementations (in simd_signal), i think they just got kinda ignored, or was written shortly after you wrote Cat and Repl, icr.

> there's over THIRTY operators.  i really needed your help in implementing
> them and you have instead spent considerable time snd resources fighting
> every single step of the way to *not* help.  i do not know why.
> leaving that aside for you to think about

I'm trying to help, but part of helping is trying to get the API design correct before we spend even more work baking an API into all our code that ends up being changed because it didn't actually work as originally designed...
> 
> 
> there are two REALLY complex ones:
> 
> * Array (and ArrayProxy).
> 
> * ast.Switch aka SimdSignal.__Switch__.
> 
> i did the analysis, under bug #458, and whoo, even just the analysis
> takes a hell of a lot of explaining.
> 
> unfortunately because ast.Switch is the basis of m.If, m.FSM and
> m.Switch, i can't provide "full proof" until it's actually 100%
> implemented.

I said I would implement and I will...once we decide which API it'll be based on (hence arguing in here about the API).

We're 99% of the way through deciding API.

While SimdSignal itself will probably be faster to write, actually *using* it, as it was designed, would be waay more complex, because it does change the semantics of how nmigen operates, such as requiring the shape input to Signal be an int or dict of ints, whereas all our code expects to be able to pass Shape-castable types to Signal, example:
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/div/fsm.py;h=1b22ca6f3f145f58e547451f496106e07bcc188d;hb=HEAD#l102

If we add some lines to layout/SimdLayout, then it automatically handles *all* Shape-castable inputs:

class SimdLayout(Shape):
    def __init__(self, lane_shapes=1, ...):
        # Signal() gives shape == unsigned(1), so we match that.
        # We use lane_shapes=signed(...) rather than a `signed` argument.
        ...
        if isinstance(lane_shapes, SimdLayout):
            # next two lines should probably be factored out into lane_shapes property
            s = lane_shapes.signed
            lane_shapes = SimdMap.map(lambda v: Shape(v, s), lane_shapes.lane_widths)
        else:
            lane_shapes = SimdMap.map(Shape.cast, lane_shapes)
        lane_widths = {}
        signed = None
        for i in vec_el_counts:
            if signed is None:
                signed = shape.signed
            else:
                assert signed == shape.signed
            lane_widths[k] = shape.width
        assert signed != None
        # rest of code uses lane_widths, not lane_shapes
        ...

> 
> if you would like to rectify that you can help implement the
> __Slice__ override.

I can work on slicing.
> 
> 
> 
> > it's easier and more consistent to just have a separate
> > API that mirrors nmigen's API, allowing unmodified nmigen to work with our
> > code, as well as making our code much clearer. This wouldn't require that
> > much effort, since we can use most of the existing PartitionedSignal code
> > with minor modifications and renaming.
> 
> it results in a massive duplication and msintenance burden.  please
> read the RFC, i documented clearly why what you advocate would be
> disastrous in multiple ways.

yeah, now i can see how it could have a big maintenance burden, however imho forking nmigen may have a bigger maintenance burden.

> 
> last section:
> 
> https://libre-soc.org/3d_gpu/architecture/dynamic_simd/

Replying to wiki concerns:

> All of these ideas, unfortunately, are extremely costly in many different
> ways:
> 
> 1. Any overrides of any kind give a catastrophically fatal impression
>   that the high-level language behaviour might in some tiny way be
>   different,
>   purely because of the very *existence* of an override class.

I think that won't be much of a problem if we specifically state throughout our docs that our API is intended to be equivalent to nmigen. This is similar to how pypy provides extra features over cpython, yet implements the exact same language. No one thinks that pypy's existence means that cpython is useless...

> 2. Proponents of such override mechanisms (and there have been many)
>   fundamentally misunderstand the distinction between Type 1 (AST)
>   low-level
>   and Type 2 (dsl.Module) high-level nmigen language constructs, and
>   miss the fact that dsl.Module (Type 2) is 100% implemented *in* AST
>   (Type 2) constructs.

you mean implemented in AST (Type 1)?

No, I understand your point about that completely, I just think that your point isn't sufficient justification for permanently forking nmigen.

> 3. Wrapper classes are a maintenance nightmare. Unless included in the
>   library itself, any update to the library being wrapped is a huge risk
>   of breaking the wrapper.

well, if we design the wrappers such that we only use the public API of nmigen, nmigen's API stability prevents us from needing to rewrite our code -- it is the exact same stability that prevents *all other* nmigen users from needing to rewrite their code when nmigen is updated. Wrappers, if properly designed, don't ever go in and mess with nmigen's internals, they only provide an API that looks like nmigen.

>  Long-term this is completely unsustainable.
>   Likewise monkey-patching.

Monkey-patching is messing with nmigen's internals and I never advocated for it.

> 4. Wrapper classes introduce huge performance degradation. Every function
>   requires one additional function call. Every object instantiated
>   requires
>   both the wrapper class to be instantiated and the object being wrapped,
>   every single time.

Is this really such a performance sink? if so, it makes me think using Python was a bad choice...Rust has no issues with wrappers, in fact, it has a special type layout specifically intended for wrappers: `#[repr(transparent)]`. Rust totally lives up to the zero-cost abstractions promise.

>  A single wrapper class is never enough: the entire
>   class hierarchy, everything that is ever instantiated by a "wrapped"
>   class instance, also requires wrapping.

Well, since we're only using SIMD in ALUs, we only need to wrap nmigen's core API, which is only 5 or 6 classes (ignoring the ones we're overriding anyway) -- I did go through the whole public api of nmigen.hdl and check them all.
Class list (additional classes we'd need to override):
Value, Const, Statement, Assign, Module, and maybe Record

>  This quickly
>   gets completely out of hand and diverts developer time into a
>   nightmare time-sink that has actively harmful consequences *even
>   once completed*.
>   Memory requirements double; performance degrades.

I highly doubt memory requirements are anything to worry about, i'd expect it to be on the order of 100MB or 1GB at the most.

>   Both are unacceptable.

I think that's exaggerating...

> 5. "Explicit coding" is actually more costly even than for-loop
>   code duplication.  It is "The Right Thing (tm)" in order to achieve
>   optimal gate-level design but requires hyper-aware and hyper-intelligent
>   developers, whom, if ever lost, take knowledge of the internal workings
>   of ultra-complex code with them.

Agreed.

> 6. "Low-level classes" have in fact already been implemented: over a
>   dozen modules utilised and combined behind PartitionedSignal exist
>   and are already in use. PartitionedMux was one of the very first
>   SIMD-aware
>   "Type 1" AST operators written.  The problem is that it's nowhere near
>   enough.  Conversion of 100,000 lines of readable nmigen HDL using
>   Type 2 (m.If/Elif) high-level constructs to **not** use those
>   constructs and return to AST Mux, Cat, and so on is a completely
>   unreasonable expectation.

Agreed, though I'll admit that converting everything to use Mux was my original plan a year ago.

> 7. "Replacements for dsl.Module" aside from coming with the catastrophic
>   implication that they provide alternative behaviour from dsl.Module
>   are a heavy maintenance burden.  Not only that but there is the risk
>   that, unless actually included *in* nmigen, the underlying AST modules
>   that they use might change, or be deprecated, or have bugs fixed which
>   break the replacement.

I addressed those points above.

> 8. "Compilers". It is very well known that any python program that outputs
>   another python program as an ASCII output may then immediately read that
>   ASCII output from within the very program that created it, `eval` it,
>   and then execute it.  That being the case, you might as well skip the
>   output to ASCII and the eval part, and create classes that
>   dynamically and directly instantiate the desired target behaviour.
>   This saves months of effort but is a common mistake, frequently made.

well, compilers do have the benefit of being waay easier to debug, since you can see their whole output...the alternative (just creating classes at runtime) is waay more opaque.

Back to bugzilla message:

> 
> 
> > 
> > > > whereas
> > > > SimdSignal acts like a *list* of scalar values -- aka. a Vector.
> > > 
> > > yes.  and therefore the entire suite of operators of SimdSignal can be
> > > *defined* to act in list-form, at every single level in the entirety of ast.*
> > > operations.
> > 
> > I'm saying the existing design only does a 90% job, it covers list-form +,
> > -, *, &, ^, |, .eq, etc.
> > 
> > It does *not* cover Shape.width, Signal.like, etc.
> > 
> > I'm saying we should finish off getting to 100% of the API looking like a
> > transparently vectorized (aka SIMT) 
> 
> SIMT i keep telling you is the wrong word to use.  you mean SIMD

I mean SIMT, not SIMD. I'm explicitly referring to what the code looks like when compared to what SIMT/SIMD looks like from a SW programmer's perspective.

C++ SIMD code looks like this:
void f(f32x4 a, f32x4 b) {
    f32x4 c = a + b; // element-wise add -- looks same, good!
    static_assert(sizeof(a) == 16); // size of whole vector -- different, bad
    u32x4 bits = (u32x4)a; // bitcast -- different, bad
    // fp->int conversion -- different, bad
    u32x4 values = __builtin_convertvector(a, u32x4);
    i32x4 gt = a > b; // result is not bool, bad
    f32x4 max = (f32x4)((i32x4)a & gt | (i32x4)b & ~gt); // mux -- can't use `if`, bad
}

Notice how all types are separate wide SIMD types and operations often do something other than the element-wise version of what scalar code would do.

C++ SIMT code looks like this:
void f(float a, float b) {
    float c = a + b; // element-wise add
    static_assert(sizeof(a) == 4); // size of element
    uint32_t bits = bit_cast<uint32_t>(a); // bitcast
    uint32_t values = (uint32_t)a; // fp->int conversion
    bool gt = a > b;
    float max = b;
    if(gt)
        max = a;
}

Notice how all types are element types and operations always operate element-wise and look exactly like scalar code and how control-flow constructs can be used.

> 
> > version of nmigen
> > Shape/Signal/Value/etc.,
> 
> jacob: please listen to what i am telling you.  what you envisage is EXACTLY
> what is being designed.

good!

> > just with different names, since that will make it
> > much easier to use/understand. Note that Shape.width (lane width) not being
> > a plain integer comes from lanes being XLEN-width, not from being
> > vectorized. those are orthogonal unrelated changes. We could have a totally
> > scalar Signal/Value/etc. that are all XLEN-ified, that would require
> > modifying Shape.width to be multivalued.
> 
> yes.  that is what lane_shapes is for.

I'm saying it's specifically Shape.width that should be multi-valued, because Shape.width is part of the nmigen API so it should be the SIMT element type's width, not the full-SIMD-vector's width.

lane_shapes, because it's not part of the nmigen API, can be whatever we choose.

Comment 116 Jacob Lifshay 2021-10-14 22:35:56 BST

(In reply to Luke Kenneth Casson Leighton from comment #113)
> (In reply to Jacob Lifshay from comment #112)
> 
> > After thinking a bit, the actual issue is that element-widths are the thing
> > arithmetic needs to happen on
> 
> yes, good, you understand now.

I've always understood that part, I've just been trying to tell you that cuz it appeared as though you were advocating for stuff to work on the full simd width and then calculate element widths based on that.
> 
> > -- you can come up with whatever algorithm you
> > like for deciding what full simd width to use,
> > as long as the element widths
> > take priority.
> 
> (that algorithm is: layout())
> 
> there are actually two separate and distinct cases.
> 
> 1) the add6gs example you gave (which is the same as the rwlwinm example)
> 
> 2) the times where the whole SIMD Signal is defined by the register file
>    width (64 bit fixed width)

I've been planning on register i/o signals just being set to XLEN, since that gives a 64-bit simd signal when vec_el_counts is setup for 64-bit vectors, it also automatically adapts for if we want a 32-bit cpu by setting vec_el_counts to:
vec_el_counts = {
    IntElWid.I32: 1,
    IntElWid.I16: 2,
    IntElWid.I8: 4,
}
 
> edit: but you see now, why because the arithmetic needs to be done
> simultaneously on *all* vector element widths at *all* elwidths,
> only rudimentary arithmetic can be performed?

that's not actually the case...

> bitwise OR of 0x9 onto a simultaneous vector width of 8 and 16 makes
> absolutely no sense.

There aren't many uses, but it still makes sense for XLEN | 9:
elwid=8: 8 | 9 = 9
elwid=16: 16 | 9 = 25
elwid=32: 32 | 9 = 41
elwid=64: 64 | 9 = 73

Example use:
`some_shape | 0x7` to round it up to 1 less than a multiple of 8

Also, when implementing __add__ and friends, don't forget cases like:
XLEN + XLEN % 2

Comment 117 Jacob Lifshay 2021-10-14 22:40:51 BST

Other cases where XLEN should probably be something separate than a SimdShape:
XLEN - pop_count(...)

log2(XLEN)

ceil-div:
(temporarily has negative values, i expect layout to not work with negative inputs)
a = -(-b // 5)

Comment 118 Jacob Lifshay 2021-10-14 22:45:17 BST

(In reply to Jacob Lifshay from comment #117)
> Other cases where XLEN should probably be something separate than a
> SimdShape:
> XLEN - pop_count(...)
> 
> log2(XLEN)
> 
> ceil-div:
> (temporarily has negative values, i expect layout to not work with negative
> inputs)
> a = -(-b // 5)

exponent_bias = (1 << (fp_exponent_sizes[XLEN] - 1)) - 1
unbiased_exponent = Signal(signed(fp_exponent_sizes[XLEN] + 1))
biased_exponent.eq(exponent_bias + unbiased_exponent)

Comment 119 Jacob Lifshay 2021-10-15 06:13:09 BST

(In reply to Jacob Lifshay from comment #108)
> (In reply to Luke Kenneth Casson Leighton from comment #107)
> > (In reply to Jacob Lifshay from comment #106)
> > > I read through all the new layout() code, looks mostly fine, though afaict
> > > the computation of bmask is incorrect if the object was to figure out which
> > > partitions are always padding.
> > 
> > yes, i can't quite work it out. if you can fix that, these algorithmic
> > things do my head in.
> 
> yeah, I can do that tomorrow (actually today, but you get what I meant).

Done. I changed bmask to be such that bmask & (1 << i) is set if partition index i is always padding. Note that there is one more partition than the number of ppoints, since ppoints separate partitions, not are partitions.

Demo manually worked out in comments:
https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=6c32022ec750433ff5adca21f4345cfca51181e2;hb=HEAD#l324

# combined with vec_el_counts {0:1, 1:1, 2:2, 3:4} we have:
# elwidth=0b00 1x 24-bit
# elwidth=0b01 1x 12-bit
# elwidth=0b10 2x 5-bit
# elwidth=0b11 4x 6-bit
#
# bmask<--------1<----0<---------10<---0<-------1<0<----0<---0<----00<---0
# always unused:|     |     |    ||    |    |   | |     |    |     ||    |
#      1111111111000000 1111111111000000 1111111100000000 0000000000000000
#               |     |     |    ||    |    |   | |     |    |     ||    |
# 0b00 xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxx........ ..............24|
# 0b01 xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxx..........12|
# 0b10 xxxxxxxxxxxxxxxx xxxxxxxxxxx....5|xxxxxxxxxxxxxxxx xxxxxxxxxxx....5|
# 0b11 xxxxxxxxxx.....6|xxxxxxxxxx.....6|xxxxxxxxxx.....6|xxxxxxxxxx.....6|
#               ^     ^          ^^    ^        ^ ^     ^    ^     ^^
#     ppoints:  |     |          ||    |        | |     |    |     ||
#               |  bit-48        /\    | bit-24-/ |     | bit-12   /\-bit-5
#            bit-54      bit-38-/  \ bit-32       |   bit-16      /
#                                 bit-37       bit-22          bit-6

So, in that example bmask = 0b101001000000

Comment 120 Luke Kenneth Casson Leighton 2021-10-15 07:26:18 BST

(In reply to Jacob Lifshay from comment #119)

> Done. I changed bmask to be such that bmask & (1 << i) is set if partition
> index i is always padding. Note that there is one more partition than the
> number of ppoints, since ppoints separate partitions, not are partitions.
> 
> Demo manually worked out in comments:
> https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/
> layout_experiment.py;h=6c32022ec750433ff5adca21f4345cfca51181e2;hb=HEAD#l324


darn it closed the tab by mistake lost a reply. try again.

this is really good. saved several days, i would have taken that long to
work this out. thank you.

the demo / test is really clear, shows what is going on.

also i noticed you did incremental commits, really appreciated.
each one took like only 30 seconds to review, really easy to
understand. i am under enormous pressure at the moment so this
was important.

Slice and Part next. will find the bugreport. (edit: bug #716)

Comment 121 Luke Kenneth Casson Leighton 2021-10-28 12:41:27 BST

context: this is starting to get complicated so i am moving the discussion
back to the bugtracker

http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-October/003972.html

i just put in some unused code (in advance of setting up the examples
in the URL https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/

https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=417f7f93bf3baf576e537cde32f118b2a50eab46

it's over 50% comments because it's... complicated (dual mathematics rules,
PRIORITY_FIXED, PRIORITY_ELWID).

basically: when both LHS and RHS are PRIORITY_FIXED (both had a lane_shapes
of None), then it should be obvious that the "Correct Thing To Do" is just
to multiply the two fixed_widths together and create a SimdShape out of
that.

why?

because *both* LHS and RHS were established (created) with the requirement
that "elwidths shall be auto-computed to fill the entirety of the partition,
as determined by overall width and the vec_el_counts,
regardless of what elwid is set to.

on the other hand, when *either* LHS or RHS have had an elwidths argument
given, it is the elwidths that have to be preserved (multiplied)
individually:

    result.lane_shapes[0] = LHS.lane_shapes[0] * RHS.lane_shapes[0]
    result.lane_shapes[1] = LHS.lane_shapes[1] * RHS.lane_shapes[1]
    ...
    ...

the only case i'm not sure about is PRIORITY-BOTH, there.  i don't
believe it's safe for the return result to be PRIORITY-BOTH, because
you could have LHS[0] = 1000, RHS[0] = 1, and LHS[1] = 1, RHS[1] = 999,
and a fixed_width doesn't work out, there.