in https://bugs.libre-soc.org/show_bug.cgi?id=708#c4 the idea was raised to allow the length of signals within each partition to be *less* than the full length that each partition is actually capable of, and for the length limit to be dynamically specifiable depending on the runtime partition settings. for example: * a PartitionedSigbal of size 32 * four partitions allowed (8-8-8-8) * when the partition lengths are set to e.g. 16-16 the actual signals are restricted to 8-8, with the upper 8 bits of each 16-16 being ignored * when the partitions are set to e.g. 24-8 the actual signals are restricted to 16-4 these alternative lengths to *simultaneously* be supported at runtime. whilst in many cases "workarounds" can be done which mitigate the need for this as a primary feature, certain circumstances such as IEEE754 FP exponent and mantissa being strange non-partition-aligning values depending on the partition settings make this challenging to workaround. this therefore needs a proper evaluation including estimating the time taken to design, write unit tests and Formal Correctness Proofs, as well as evaluate if it is possible to cleanly adapt the existing PartitionedSignal. Documented here: https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/
i gave this some thought and i believe it should be reasonably straightforward to adapt PartitionedSignal and the 15 or so specialist modules behind it to cope with adaptive Signal truncation. the more recent ones (Cat, Assign) should be easy to enhance because they loop through all combinations available to the mask and create a switch statement. truncating and zeroing the output from each lane, in each of those switch statements, is therefore pretty straightforward: it can be handled by a function which performs the truncation. the required truncation width is very straightforward to look up on a per switch basis. concatenation of the post-truncated partial results is also simple, exactly as is done already. the earlier ones (eq gt le shift) which have already been optimised and are based around a gate-level design principle will take a bit more thought and effort. based on appx average: * half a day per code-change (to add to the specialised module) * half a day for unit test * half a day for Formal Proof and there being around 20 Type I (Ast) "operators": * shift add sub neg * or not and xor * eq gt lt * Assign Cat Part Switch Repl * bool all some any xor the time estimated for this task is around 35 days. some of these will be less (and, or, not), some a lot longer (shift/rot) due to them slready being optimised, and particularly complex (3 weeks) to originally analyse.
in the existing (regularly-allocated, full-allocated) design the partitions do not need special treatment. within a given partition the partial result is always computed, and from that an optimisation has been possible to deploy at the gate level which a naive "switch" on the mask could not possibly hope to replicate. a variable-length per-partition breaks that... *unless... a pre-analysis phase is carried out which splits out the high-level "required" partitioning from a *lower-level* partitioning. settings: * all 32 and 16-bit values are actually to be truncated to 11 bit * all 8-bit values to 5-bit from these we can write out: 31 16 15 8 7 0 32bit |10 .... 0 | 16bit 26 ... 16 | |10 .... 0 | 8bit | 28.24| 20.16| 12..8 | 4.. 0 | thus, we deduce, we *actually* need breakpoints at: 28 26 24 16 12 8 4 0 and when 32bit is set: * 10 is closed * 11-31 "masked to zero" (clock gated, in future) when 16bit: * 10 is closed * 16 is closed * 26 is closed * 10-15 is zeromasked / clock-gated * 27-31 is zeromasked / clockgated when 8bit: * 4 is closed * 8 is closed * 12 is closed * 16 is closed * 24 is closed * 28 is closed which is a lot more partitions, and a lot more work, but provides an implementation *in terms of existing code* with very little disruption. it's also very straightforward. reconstruction of results is also easily done by leveraging the existing optimised code: it is just that the partition sizes are non-uniform lengths. a "map" function can easily take care of converting the incoming spec (required lengths) to actual partition sizes and an "active" mask, which, interestingly, can hook directly into the planned "Assign" conditional (predicated) assignment some care needs to be taken to ensure that the inputs are not too heavily MUXed, creating long gate delays. at the moment all inputs are unconditional, the selection is done on the output. Cat will need some more thought, to see if the irregular size can be dealt with. in particular, construction of the "size spec" for the output may be a bit hairy. each possible length combination under each partition mask value will have to be calculated.
PartitionedSignal.shape() is a good function to use as the definition of the lengths. question: is a PartitionedShape necessary and what are the implications if derived from ast.Shape()
insight: initially thought, "why have gaps at all", why not have the alternative length parts back-to-back. Spec: * 32bit -> 11 bit * 16bit -> 2x 10 bit * 8bit -> 4x 8 bit Actual layout: * 32 bit 10..0 * 16 bit 19..10 plus 9..0 * 8 bit 31..24 23..16 15..8 7..0 no. it has to be: * 32 bit 10..0 * 16 bit 25..16 9..0 * 8 bit 31..24 23..16 15..8 7..0 because of alignment and when "interacting" as the input/output involving other (regularly sized) partitioned signals. without the alignment to regular boundaries you actually need special conditional partition-aware routing very awkward. alignment cleans that up. but, has to have "gaps" as a result.
to make sure this thought is written down: i realised a few days ago, some of the Case/Switch submodules for PartitionedSignal, particularly Cat, Part and Assign, are approaching the gate density of full crossbars. that isn't going to fly. a way to reduce the number of required entries is needed: the option to specify what combinations of mask are supported.
I already wrote code to fully determine the layout needed for partitioned signals including padding (SimdLayout), as well as separating out the actual signals determining which partitions are enabled across a whole SIMD circuit (SimdPartMode) -- making it easy to convert layouts and connect signals with different layouts but that are still partitioned in the same abstract way -- essentially creating a specific Python class where its instances represent compatible categories of layouts. It also has code for representing the different ways padding can be filled (SimdPaddingKind) -- it probably should be merged into SimdLayout. https://salsa.debian.org/Kazan-team/simd_signal See TestSimdPartMode and TestSimdLayout in test_simd_signal.py for some examples. All that is needed is to adapt code so SimdLayout and SimdPartMode work with PartitionedSignal.
(In reply to Luke Kenneth Casson Leighton from comment #5) > to make sure this thought is written down: i realised a few days ago, > some of the Case/Switch submodules for PartitionedSignal, particularly > Cat, Part and Assign, are approaching the gate density of full crossbars. > that isn't going to fly. > > a way to reduce the number of required entries is needed: the option > to specify what combinations of mask are supported. SimdPartMode partially solves this problem by explicitly reducing the set of supported layouts to those that you can produce by a buddy allocator. this removes the less useful lane sizes (that we realistically weren't going to ever use anyway) and retains the more useful ones. example table of supported and excluded layouts: https://salsa.debian.org/Kazan-team/simd_signal/-/blob/b77b188c51c91c3ddfdd6f41c9e22e0bfa5d41e3/src/tests/test_simd_signal.py#L19
(In reply to Jacob Lifshay from comment #6) > I already wrote code to fully determine the layout needed for partitioned > signals including padding (SimdLayout), ok good, because this moves out of the area originally for which PS was designed. > All that is needed is to adapt code so SimdLayout and SimdPartMode work with > PartitionedSignal. excellent, that eould be great. i am having a bit of a hard time understanding SimdLayout, etc., can you cut the type information, all "casting", they are extraneous lines, and add some docstrings explaining the purpose of the main functions? my feeling is, it should be possible to return an object from the standard override of Value.shape() and use that. most cases shape() is used to determine the signedness. if that doesn't pan out it can be adapted. a way to fit with PartitionPoints is needed: an adapter that creates the dict that goes straight into Partition Points yet also can be used to return the list from which the case statements can be constrcted, will find an example there is too much code to do intrusive rewrites btw ok here we go: https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_cat/cat.py;hb=HEAD#l100 that's dirt simple. the switch is on the mask (extracted from Ppoints), it enumerates all possible values. *without* being massively complex, we need something that *does not* REPLACE PartitionPoints but feeds input *TO* PartitionPoints then when enumerated instead of creating a full set of all combinations of cases only creates those actually needed or, adds a simple function *to* PartitionPoints which returns the options. bear in mind that some of the code cannot be adapted to *not* generate full options. it doesn't work by switch/case, it's gate-level. these submodules all take the same mask bits hence why it is necessary to fit with straight mask bits and keep PartitionPoints, not try ripping things apart. *careful* compatibility and *bare minimum* nondisruptive changes. but first please can you take out type info and add docstrings so i can see what is going on with SimdLane
going back to what is needed: * elwidth as input (or, an enum, or string) * for each elwidth, an actual width (power of 2) (or, the number of subdivision) * and a (nonp2) "use this many bits" width this suggests a dict as the spec. mantissa: { 0b00 : (64, 53), # FP64 0b01 : (32, 23), # FP32x2 0b10 : (16, 10), # FP16x4 0b11 : (16, 5), # BF16x4 } no types, no classes. turning that into PartitionPoints, a simple function creates a dict and a mask Signal. this i would expect to be about... 30 lines. a second function, assuming an Enum Signal (or elwidth Signal of 2 bits) would return a list of comb statements that: * accept the enum as input * performs a switch * sets the relevant mask bits that would again be no more than about... 25 lines. could you do a quick hack prototype of the first of those functions? take dict as input (and width parameter) and return part points dictionary?
(In reply to Luke Kenneth Casson Leighton from comment #9) > going back to what is needed: > > * elwidth as input > (or, an enum, or string) > * for each elwidth, an actual width (power of 2) > (or, the number of subdivision) > * and a (nonp2) "use this many bits" width > > this suggests a dict as the spec. mantissa: > > { 0b00 : (64, 53), # FP64 > 0b01 : (32, 23), # FP32x2 > 0b10 : (16, 10), # FP16 > 0b11 : (16, 5), # BF16 > } > > no types, no classes. Oh, what I have as types that can be SimdLayout.cast (like nmigen Shape.cast) are 1 of 3 options: 1. dict-like types that map lane-size-in-abstract-parts to Shape-castable values: This (after Shape.cast-ing all dict values) is the canonical internal representation of a layout (stored in self.lane_shapes). example: { # keys are always powers of 2 1: 5, # 1-part lanes are u5 2: unsigned(3), # 2-part lanes are u3 4: range(3, 25), # 4-part lanes are u5 since that fits 24 8: MyEnum, # 8-part lanes are u10, assuming MyEnum fits in u10 } or: { # keys are always powers of 2 1: signed(1), # 1-part lanes are i1 2: signed(3), # 2-part lanes are i3 4: range(-30, 25), # 4-part lanes are i6 since that fits -30 8: MySignedBoolEnum, # 8-part lanes are i1 16: signed(0), # 16-part lanes are i0, zero-bit shapes are supported } 2. callables that take lane-size-in-abstract-parts and return Shape-castable values. Useful for making non-uniform layouts quickly. example: lambda lane_size: signed(lane_size * 8) dict equivalent (for part_count == 8): { 1: signed(8), 2: signed(16), 4: signed(32), 8: signed(64), } 3. a Shape-castable value that is used for all lane-sizes. Useful for making uniform layouts quickly. example: 5 dict equivalent (for part_count == 8): { 1: unsigned(5), 2: unsigned(5), 4: unsigned(5), 8: unsigned(5), } From that dict-equivalent (stored as self.lane_shapes), it computes the minimum number of bits needed for each abstract part required to pack those chosen lane shapes into their abstract parts -- the bits needed per abstract part is self.padded_bits_per_part (same value for all parts). From that, member functions can calculate the total required signal width in bits (self.width), and the unpadded/padded slices for each lane (self.lane_bits_slice(lane, include_padding=False/True)). > > turning that into PartitionPoints, a simple > function creates a dict and a mask Signal. PartitionPoints needs to be re-vamped to carry all the necessary information -- that's what SimdLayout is. The mask signal is wrapped by SimdPartMode, which provides all the necessary introspection and utility functions to make it easy to use.
(In reply to Luke Kenneth Casson Leighton from comment #8) > (In reply to Jacob Lifshay from comment #6) > > I already wrote code to fully determine the layout needed for partitioned > > signals including padding (SimdLayout), > > ok good, because this moves out of the area originally for which PS > was designed. > > > All that is needed is to adapt code so SimdLayout and SimdPartMode work with > > PartitionedSignal. > > excellent, that eould be great. > > i am having a bit of a hard time understanding SimdLayout, > etc., can you cut the type information, all "casting", they are > extraneous lines, and add some docstrings explaining the > purpose of the main functions? > > my feeling is, it should be possible to return an object > from the standard override of Value.shape() and use that. > most cases shape() is used to determine the signedness. > if that doesn't pan out it can be adapted. I think we should use: class PartitionedSignal(...): ... def __init__(self, layout=1, *, ...): self.layout = SimdLayout.cast(layout) ... shape = None since shape is limited by nmigen to just Shape(width,signedness) and nothing else. > > a way to fit with PartitionPoints is needed: an adapter > that creates the dict that goes straight into Partition > Points yet also can be used to return the list from > which the case statements can be constrcted, will find > an example well, unfortunately, PartitionPoints just can't really express the required padding info: assuming a layout of: SimdLayout({ 1: unsigned(10), 2: unsigned(12), 4: unsigned(15), }) which ends up with a 40-bit signal with lanes like so: bit #: 30v 20v 10v 0 [---u10--][---u10--][---u10--][---u10--] [ppppppp----u12----][ppppppp----u12----] [pppppppppppppppppppppppp------u15-----] There are 2 possible approaches for emulating PartitionPoints, neither good: 1. if we have the padding as separate partitions, then the created circuits will be unnecessarily complex due to excessive partitions with partition points at bits: 10, 12, 15, 20, 30, and 32. This gets significantly worse for 8-part layouts. 2. if we have padding included in their partitions, which gives partition points of 10, 20, and 30, then many of the existing operators will calculate the wrong results since they failed to account for padding. > but first please can you take out type info and add docstrings > so i can see what is going on with SimdLane yeah, i'll add docs. the type info *is* part of the docs, just put in a standardized spot where tools can read it. after all, someone writing code needs to be able to quickly know what types a function expects, not have to look through 10(!) different files following a chain of nested function calls just to figure out how to call a particular function. That actually happened to me several times when I was trying to write the div pipe, whereas if they had docs that specified what types they expected, i would have saved a bunch of time (10s instead of 10min).
(In reply to Jacob Lifshay from comment #11) > (In reply to Luke Kenneth Casson Leighton from comment #8) > > (In reply to Jacob Lifshay from comment #6) > > > I already wrote code to fully determine the layout needed for partitioned > > > signals including padding (SimdLayout), > > > > ok good, because this moves out of the area originally for which PS > > was designed. > > > > > All that is needed is to adapt code so SimdLayout and SimdPartMode work with > > > PartitionedSignal. > > > > excellent, that eould be great. > > > > i am having a bit of a hard time understanding SimdLayout, > > etc., can you cut the type information, all "casting", they are > > extraneous lines, and add some docstrings explaining the > > purpose of the main functions? > > > > my feeling is, it should be possible to return an object > > from the standard override of Value.shape() and use that. > > most cases shape() is used to determine the signedness. > > if that doesn't pan out it can be adapted. > > I think we should use: > class PartitionedSignal(...): > ... > def __init__(self, layout=1, *, ...): > self.layout = SimdLayout.cast(layout) > ... > shape = None > > since shape is limited by nmigen to just Shape(width,signedness) and nothing > else. > > > > a way to fit with PartitionPoints is needed: an adapter > > that creates the dict that goes straight into Partition > > Points yet also can be used to return the list from > > which the case statements can be constrcted, will find > > an example > > well, unfortunately, PartitionPoints just can't really express the required > padding info: > assuming a layout of: > SimdLayout({ > 1: unsigned(10), > 2: unsigned(12), > 4: unsigned(15), > }) > which ends up with a 40-bit signal with lanes like so: > bit #: 30v 20v 10v 0 > [---u10--][---u10--][---u10--][---u10--] > [ppppppp----u12----][ppppppp----u12----] > [pppppppppppppppppppppppp------u15-----] > > There are 2 possible approaches for emulating PartitionPoints, neither good: > 1. if we have the padding as separate partitions, yes. this is the simplest option that does not involve throwing away tested code. it is an incremental nondisruptive approach. >then the created circuits > will be unnecessarily complex due to excessive partitions with partition > points at bits: > 10, 12, 15, 20, 30, and 32. This gets significantly worse for 8-part layouts. not if the combinations (assuming a switch) are limited, and not if an additional "assignment" mask is added which, *incrementally and one at a time*, each of the submodules is evaluated and adapted. "inefficient but functional" is a crucial thing right now because this needs to be done FAST because it's holding up further development (critical path). yet at the same time over-engineering is going to bite us. > 2. if we have padding included in their partitions, which gives partition > points of 10, 20, and 30, then many of the existing operators will calculate > the wrong results since they failed to account for padding. that will only happen if the starting points are not set, as a requirement, to be on an "aligned" boundary (exactly like byte/word alignment for software). i covered this in comment #2 > > but first please can you take out type info and add docstrings > > so i can see what is going on with SimdLane > > yeah, i'll add docs. the type info *is* part of the docs, just put in a > standardized spot where tools can read it. honestly they're a nuisance. massive over-engineering, resulting in so much cruft to "manage the existence of the type" it at least triples the amount of code, end result *harder* to read, not easier. the example there *actually* has 7 breakpoints, can use PartitionPoints to do it, but simply needs an "adapter" function (pair of adapter functions, total about 60 lines to do it. the adapter function takes just the 3 options shown as inputs (which will go into the switch) and outputs *only* 3 possible cases setting the right combination of 7 partition bits. the "unused" bits in between although they calculate results those results are *ignored* or not wired in or out, but even if not ignored they do no harm. *LATER* we zero them out, and even later we add individual clock gating on some of them if they're partly used. but we can't do clock gating yet because we don't have a clock gating cell in the cell library. it's fully compatible with the existing code, nondisruptive, and has the advantage that same sized same spec signals can be straight-wired, no messing about with partition-aware muxing just to connect a dann signal. > after all, someone writing code > needs to be able to quickly know what types a function expects, not have to > look through 10(!) different files following a chain of nested function > calls just to figure out how to call a particular function. That actually > happened to me several times when I was trying to write the div pipe, > whereas if they had docs that specified what types they expected, i would > have saved a bunch of time (10s > instead of 10min). chains like that is pretty standard fare for a big project. tips: keep the files open, on-screen, in multiple terminals, persistently, as a working set. i use a 6x4 virtual desktop and move to a different view, leaving the persistent xterms open. last record was 300 days (power cut), 3,500 vim sessions in 100 xterms on 15 virtual destkops.
(In reply to Jacob Lifshay from comment #11) > assuming a layout of: > SimdLayout({ > 1: unsigned(10), > 2: unsigned(12), > 4: unsigned(15), > }) noooo, different types of signedness down at the lane level is completely out of the question. that significantly changes the design and, worse, breaks a fundamental assumption that the object can be treated as "if it was one thing" by nmigen Type 1 (ast.*) and Type 2 (dsl.Module) language constructs. this would be very bad and very complicated, very quickly. and i don't see any need for it. therefore, the signedness can in fact be removed, entirely, and nmigen Shape signed/unsigned flag used.
(In reply to Jacob Lifshay from comment #10) > (In reply to Luke Kenneth Casson Leighton from comment #9) > > this suggests a dict as the spec. mantissa: > > > > { 0b00 : (64, 53), # FP64 > > 0b01 : (32, 23), # FP32x2 > > 0b10 : (16, 10), # FP16 > > 0b11 : (16, 5), # BF16 > > } > > > > no types, no classes. > > Oh, what I have as types that can be SimdLayout.cast (like nmigen > Shape.cast) are 1 of 3 options: 3 options when one will do and can be covered by a dict is massive complication and overengineering. the *actual* needs come from a 2-bit elwidth, where at each elwidth i would have said even the power-2 is implicitly understood, if it wasn't for the fact that in FP we specify BF16 as one of the options. int elwidths: 0b00 64 0b01 32 0b10 16 0b11 8 FP: 0b00 FP64 0b01 FP32 0b10 FP16 0b11 BF16 16 bit not 8 that puts the "aligned power 2" width (or the number of power2 partitions) on the requirements, and it can be the first of the tuple under each key. the second requirement is the *useful* width at each elwidth, nonpow2 sized. there are no other requirements, because supporting different signed/unsigned is out of the question. it's very much simpler than SimdKind. > 1. dict-like types that map lane-size-in-abstract-parts to Shape-castable > values: > This (after Shape.cast-ing all dict values) is the canonical internal > representation of a layout (stored in self.lane_shapes). > example: > { # keys are always powers of 2 > 1: 5, # 1-part lanes are u5 > 2: unsigned(3), # 2-part lanes are u3 > 4: range(3, 25), # 4-part lanes are u5 since that fits 24 > 8: MyEnum, # 8-part lanes are u10, assuming MyEnum fits in u10 > } we need neither Enums nor range, and *definitely* signed/unsigned is out of the question. even if using this type of specification, how does it relate to elwidths? > or: > { # keys are always powers of 2 > 1: signed(1), # 1-part lanes are i1 > 2: signed(3), # 2-part lanes are i3 > 4: range(-30, 25), # 4-part lanes are i6 since that fits -30 > 8: MySignedBoolEnum, # 8-part lanes are i1 > 16: signed(0), # 16-part lanes are i0, zero-bit shapes are supported > } no. range enum and signed... arrgh, these are *subtypes*? no, absolutely not. no way. this is far too advanced, far too complicated. > 2. callables that take lane-size-in-abstract-parts and return Shape-castable > values. Useful for making non-uniform layouts quickly. > example: > lambda lane_size: signed(lane_size * 8) > > dict equivalent (for part_count == 8): > { > 1: signed(8), > 2: signed(16), > 4: signed(32), > 8: signed(64), > } signed is out of the question > 3. a Shape-castable value that is used for all lane-sizes. Useful for making > uniform layouts quickly. > example: > 5 > > dict equivalent (for part_count == 8): > { > 1: unsigned(5), > 2: unsigned(5), > 4: unsigned(5), > 8: unsigned(5), > } unsigned is out of the question. > > From that dict-equivalent (stored as self.lane_shapes), it computes the > minimum number of bits needed for each abstract part required to pack those > chosen lane shapes into their abstract parts -- the bits needed per abstract > part is self.padded_bits_per_part (same value for all parts). > > From that, member functions can calculate the total required signal width in > bits (self.width), and the unpadded/padded slices for each lane > (self.lane_bits_slice(lane, include_padding=False/True)). > > > > > turning that into PartitionPoints, a simple > > function creates a dict and a mask Signal. > > PartitionPoints needs to be re-vamped to carry all the necessary information > -- that's what SimdLayout is. above i have established that you are making fundamental assumptions (not discussed until now and only just discovering what they are) which was part of the silence and rushing ahead creating code without first talking it through. what you *believe* is necessary information is in fact unnecessary and massive over-engineering, making for several days wasted coding time unfortunately. *if* it was actually practical to support massive numbers of disparate data types *then* SimdKind etc. would be necesssary. but you didn't talk this through before going ahead so have gone down a blind alley. from the *much* more basic pragmatic requirements we can incrementally leverage existing code. if during the actual deployment we end up finding use-cases that are not covered *then* we can stop and reevaluate. however by keeping it SIMPLE from the very start and taking some short cuts we can get the job done FAST, and optimise later. practical pragmatic engineering. so. with that in mind, can you please write a quick prototype pair of functions which use a dict as input, as in comment #2, returning a PartitionPoints as a result from the first, and the second function to take a 2 bit Signal, the same dict spec, a PartitionPoint instance and a module (m), where it uses the 2 bit Signal as input to a Switch and the cases set the prerequisite mask bits embedded as values in the PP instance? each function should (he said) be self-contained, not require calling any external functions, and be absolute maximum 35 lines (without comments) *no types*, not even classes! throw them into the codebase near make_partition2 (above PartitionPoints) with some really quick demo tests, we can work out where to put them later. this is an important and valuable lesson for you to learn to do ultra-basic pragmatic programming that an HDL Engineer or a 20 year old 2nd year undergraduate student can look and and go, "oh, that's blindingly obvious".
(In reply to Luke Kenneth Casson Leighton from comment #13) > (In reply to Jacob Lifshay from comment #11) > > > assuming a layout of: > > SimdLayout({ > > 1: unsigned(10), > > 2: unsigned(12), > > 4: unsigned(15), > > }) > > noooo, different types of signedness down at the lane level is > completely out of the question. Actually, SimdLayout specifically asserts that all lanes have the same signedness, essentially it's just there to make passing in Shapes easy, and because, conceptually, it's the lanes who's shape we care about, the whole signal could be signed/unsigned/whatever it doesn't matter as long as the lanes behave as expected. > that significantly changes the design and, worse, breaks a fundamental > assumption that the object can be treated as "if it was one thing" > by nmigen Type 1 (ast.*) and Type 2 (dsl.Module) language constructs. As currently designed SimdLayout is *not* a subclass of Shape, so idk where you think it fits into nmigen, because it doesn't -- nmigen is never passed a SimdLayout instance.
(In reply to Luke Kenneth Casson Leighton from comment #14) > (In reply to Jacob Lifshay from comment #10) > > (In reply to Luke Kenneth Casson Leighton from comment #9) > > > > this suggests a dict as the spec. mantissa: > > > > > > { 0b00 : (64, 53), # FP64 > > > 0b01 : (32, 23), # FP32x2 > > > 0b10 : (16, 10), # FP16 > > > 0b11 : (16, 5), # BF16 > > > } > > > > > > no types, no classes. > > > > Oh, what I have as types that can be SimdLayout.cast (like nmigen > > Shape.cast) are 1 of 3 options: > > 3 options when one will do and can be covered by a dict > is massive complication and overengineering. internally, it always uses a dict mapping abstract-lane-sizes (pow2 integers) to nmigen Shape instances. the other options (only as inputs to SimdLayout.cast/.__init__) are there because it saves a bunch of code on the caller's side, making it much easier to read, and because accepting a Shape directly (option 3) instead of always requiring a dict makes it waay easier to use all our existing non-simd code mostly unmodified. > > the *actual* needs come from a 2-bit elwidth, where at each > elwidth i would have said even the power-2 is implicitly > understood, if it wasn't for the fact that in FP we specify > BF16 as one of the options. elwidth ends up filling the SimdPartMode.part_starts bits (like original PartitionPoints, except indexing abstract parts rather than bits), SimdLayout uses those Signals to select which possible lanes are enabled, based on the internal dict mapping abstract-lane-sizes to nmigen Shape instances. The current SimdPartMode/SimdLayout design still supports doing stuff like 4xu8 1xu32 all simultaneously, because, iirc, we are still planning on supporting packing ops into simd alus at the 32-bit level, and we could have a 4xu8 op followed by a 1xu32 op. I initially thought about having an enum as the "mask" signal (like your proposing here) instead of abstracted partition-points, but I rejected that idea because of the sheer number of enumerants needed to express the desired 4xu8 with 1xu32 combinations. > int elwidths: > > 0b00 64 > 0b01 32 > 0b10 16 > 0b11 8 > > FP: > > 0b00 FP64 > 0b01 FP32 > 0b10 FP16 > 0b11 BF16 16 bit not 8 FP16/BF16 -- didn't think of that, will require either signalling outside of SimdLayout or redesigning SimdLayout to cope. > > that puts the "aligned power 2" width (or the number of power2 partitions) > on the requirements, and it can be the first of the tuple under each key. > > the second requirement is the *useful* width at each elwidth, nonpow2 > sized. > > there are no other requirements, because supporting different signed/unsigned > is out of the question. I partially disagree, uniform signed/unsigned is needed and supported by SimdLayout by having input lanes' types be nmigen Shape (or castable via Shape.cast). non-uniform is unnecessarily complicated and SimdLayout will raise AssertionError for that case. > > 1. dict-like types that map lane-size-in-abstract-parts to Shape-castable > > values: > > This (after Shape.cast-ing all dict values) is the canonical internal > > representation of a layout (stored in self.lane_shapes). > > example: > > { # keys are always powers of 2 > > 1: 5, # 1-part lanes are u5 > > 2: unsigned(3), # 2-part lanes are u3 > > 4: range(3, 25), # 4-part lanes are u5 since that fits 24 > > 8: MyEnum, # 8-part lanes are u10, assuming MyEnum fits in u10 > > } > > we need neither Enums nor range, and *definitely* signed/unsigned is > out of the question. all of those are converted to Shape instances by nmigen's Shape.cast (called by SimdLayout's constructor). once constructed, the internal fields use only nmigen Shape instances. Signed/Unsigned is needed because we need to support signed/unsigned multiply, signed/unsigned compare, signed/unsigned divide (as a SIMD ALU, not as a PartitionedSignal op), signed/unsigned right shift, etc. Oh, i just realized ALU-level signed/unsigned (separate from lane-level signedness) is another thing that needs to go in the SimdPartMode key, along with F16/BF16/etc. so the key would be like: class MyIntKey(Enum): U8 = ... I8 = ... U16 = ... I16 = ... U32 = ... I32 = ... U64 = ... I64 = ... INT_WIDTH_IN_PARTS = { MyIntKey.U8: 1, MyIntKey.I8: 1, MyIntKey.U16: 2, MyIntKey.I16: 2, MyIntKey.U32: 4, MyIntKey.I32: 4, MyIntKey.U64: 8, MyIntKey.I64: 8, } class MyFpKey(Enum): F16 = ... BF16 = ... F32 = ... F64 = ... FP_WIDTH_IN_PARTS = { MyFpKey.F16: 1, MyFpKey.BF16: 1, MyFpKey.F32: 2, MyFpKey.F64: 4, } > even if using this type of specification, how does it relate to > elwidths? elwidths determine which lane-size-in-abstract-parts is used. > > > or: > > { # keys are always powers of 2 > > 1: signed(1), # 1-part lanes are i1 > > 2: signed(3), # 2-part lanes are i3 > > 4: range(-30, 25), # 4-part lanes are i6 since that fits -30 > > 8: MySignedBoolEnum, # 8-part lanes are i1 > > 16: signed(0), # 16-part lanes are i0, zero-bit shapes are supported > > } > > no. range enum and signed... arrgh, these are *subtypes*? no, they're just types that Shape.cast supports converting to Shape. > > no, absolutely not. no way. this is far too advanced, far too complicated. it's easy, we let Shape.cast do all the hard work, SimdLayout just passes the inputs to nmigen Shape.cast.
(In reply to Jacob Lifshay from comment #15) > (In reply to Luke Kenneth Casson Leighton from comment #13) > > (In reply to Jacob Lifshay from comment #11) > > > > > assuming a layout of: > > > SimdLayout({ > > > 1: unsigned(10), > > > 2: unsigned(12), > > > 4: unsigned(15), > > > }) > > > > noooo, different types of signedness down at the lane level is > > completely out of the question. > > Actually, SimdLayout specifically asserts that all lanes have the same > signedness, essentially it's just there to make passing in Shapes easy, and > because, conceptually, it's the lanes who's shape we care about, the whole > signal could be signed/unsigned/whatever it doesn't matter as long as the > lanes behave as expected. jacob: PartitionedSignal does not comprehend lane-based independent signs or types. none of the submodules know about it. there is *one* internal Signal which is given the shape at the max bitwidth required and its sign and that's it. adding anything else on top of that is too much. the requirements here are very very simple, this is an enhancement we are discussing which cannot disrupt the existing code and has to be minimal and ultra quick and simple to implement. you could have written the functions by now, they are far shorter than the text that you have typed *resisting* the path that i am trying to illustrate to you and the important insight i would like to share. if i am forced to do the code myself you will in no way have the same insight and understanding. please: can you write the two functions, they should take you only about 20 minutes absolute max.
(In reply to Jacob Lifshay from comment #16) > it's easy, we let Shape.cast do all the hard work, SimdLayout just passes > the inputs to nmigen Shape.cast. it's massively overengineered. the very fact that you wrote text 5x longer than the simple functions i outlined should be telling you everything you need to know, here. it also goes far ahead of what i can understand and process, and is absorbing far more time than it should. you're rushing ahead and solving problems that haven't yet been demonstrated to be needed. i have *no problem at all* with *ending up* with a design that is as complicated as SimdKind however it needs to be demonstrated and driven by *actual need* rather than "predicted non-demonstrated requirements". can you please therefore write those two simple functions which will fit into the existing code. this will then allow progress and effort put into fitting PartitionedSignal into the ALUs. *at that* point if we *actually* encounter *actual* needs for advanced features *then and only then* do we come back and revisit. yes i used to also overengineer. it didn't go well.
> Signed/Unsigned is needed because we need to support signed/unsigned > multiply, signed/unsigned compare, signed/unsigned divide (as a SIMD ALU, > not as a PartitionedSignal op), signed/unsigned right shift, etc. yes. and this is already done. but NOT on a per lane basis. on a GLOBAL basis. therefore sign should not even be part of SimdKind because it is part of the PartitionedSignal. please can you simply write those two simple functions which i have asked four times now and it will become clear when they are integrated into PartitionedSignal and PartitionPoints.
wip layout demo script: from collections.abc import Mapping from pprint import pprint # stuff to let it run as stand-alone script class Shape: @staticmethod def cast(v): if isinstance(v, Shape): return v assert isinstance(v, int) return Shape(v, False) def __init__(self, width=1, signed=False): self.width = width self.signed = signed def __repr__(self): if self.signed: return f"signed({self.width})" return f"unsigned({self.width})" def signed(w): return Shape(w, True) def unsigned(w): return Shape(w, False) def PartitionPoints(pp): return pp # main fn def layout(elwid, part_counts, lane_shapes): if not isinstance(lane_shapes, Mapping): lane_shapes = {i: lane_shapes for i in part_counts} lane_shapes = {i: Shape.cast(lane_shapes[i]) for i in part_counts} signed = lane_shapes[0].signed assert all(i.signed == signed for i in lane_shapes.values()) part_wid = -min(-lane_shapes[i].width // c for i, c in part_counts.items()) part_count = max(part_counts.values()) width = part_wid * part_count points = {} for i, c in part_counts.items(): def add_p(p): points[p] = points.get(p, False) | (elwid == i) for start in range(0, part_count, c): add_p(start * part_wid) # start of lane add_p(start * part_wid + lane_shapes[i].width) # start of padding points.pop(0, None) points.pop(width, None) return (PartitionPoints(points), Shape(width, signed), lane_shapes, part_wid, part_count) part_counts = { 0: 1, 1: 1, 2: 2, 3: 4, } for i in range(4): pprint((i, layout(i, part_counts, unsigned(3)))) for i in range(4): l = {0: signed(5), 1: signed(6), 2: signed(12), 3: signed(24)} pprint((i, layout(i, part_counts, l)))
(In reply to Jacob Lifshay from comment #20) > wip layout demo script: fantastic, i suspected it would be that short. i'll test it tomorrow. thank you, really, for trusting me. the second function "setup" stuff will be a bit longer, but perhaps a Simulation by doing m = Module() would evade needing to write an Elaboratable class, the function needs to have as args: * the dict spec * a 2-bit Signal() (for now) * the PartitionPoints that were created from the dict * a module it should: * create an m.Switch on the 2bit signal which * enumerates the dict key/values * for each key, "enable" the mask bits that were linked to it in the layout() in the PartitionPoints this function should likewise be ridiculously short. if layout() needs to adapt/change so that there is a way to *get* the partition setting on a per-spec-key basis, either as just a simple binary value that is literally chucked at PP.mask.values() feel free. so if the dict spec was 0b00: (64, 11) 0b01: (...) ... ... then the extra dict returned from layout which setss mask values would be something like 0b00: 0b1111000, # 1s to cut off (empty) partitions 0b01: 0b0010100 this rather than the actual numerical partition points because PP.mask.values() gives you the underlying mask, Cat() them together and assignment from the dict is possible. i can never remember which way round PPs specify masks, if it's set to close or set to open a partition, apologies
(In reply to Jacob Lifshay from comment #20) a quick readthrough: > def layout(elwid, part_counts, lane_shapes): using the FP example dict from comment #2, i think you might have misunderstood: the idea is not to create a set of *separate* PartitionPoints one per elwidth, the idea is to create a *merged* set of PartitionPoints (and associated Signal mask) normally PartitionPoints has to take a mask as input, thus, we need a function that calculates the length of that mask (see make_partition2 as an example) > if not isinstance(lane_shapes, Mapping): > lane_shapes = {i: lane_shapes for i in part_counts} > lane_shapes = {i: Shape.cast(lane_shapes[i]) for i in part_counts} > signed = lane_shapes[0].signed > assert all(i.signed == signed for i in lane_shapes.values()) by removing signed all of these lines can go. this reduces the function by 25%. lane_shapes can become a straight dict containing numbers, Shape can dral with the signed argument. > part_wid = -min(-lane_shapes[i].width // c for i, c in > part_counts.items()) > part_count = max(part_counts.values()) unfortunately this lwts the function determine the lwngth of the underlying Signal, which is where i think you are getting the mistaken impression that it is impossible to use PartitionedSignal from. the overall width of the *full* signal should be passed in as an argument. > width = part_wid * part_count here. width needs to an input parameter *to* layout(), not a computed quantity, where part_wid is instead calculated by division. *then* Signals will end up at uniform predetermined lengths and *then* straight assignment (wire-only no muxes) can occur if the underlying signal ends up varying by arbitrary lengths determined by the pieces used, of *course* you can't wire them up safely because tbey're all different total sizes. fixing the length makes the power 2 alignment sizes all the same, problem solved. > points = {} > for i, c in part_counts.items(): > def add_p(p): > points[p] = points.get(p, False) | (elwid == i) this needs to allocate the actual mask Signal, i'm not totally sure how, with the length not yet being determined, hm. > for start in range(0, part_count, c): > add_p(start * part_wid) # start of lane > add_p(start * part_wid + lane_shapes[i].width) # start of padding > points.pop(0, None) > points.pop(width, None) ok now the length is known, now a Signal(len(points)) can be created, then a bit each dropped into the values in sequence order of its keys, end result just like make_partition2 > return (PartitionPoints(points), Shape(width, signed), lane_shapes, > part_wid, part_count) Shape() is not needed because width is a given and signed is not layout()'s responsibility. here also the extra dict needs to be returned which gives the options, "if you want elwidth=0b00, this is the value you have to set in the PartitionPoints mask to get that to happen" > > part_counts = { > 0: 1, > 1: 1, > 2: 2, > 3: 4, > } ok yeah, so you went with elwidth => num_of_partitions, cool. > for i in range(4): > l = {0: signed(5), 1: signed(6), 2: signed(12), 3: signed(24)} > pprint((i, layout(i, part_counts, l))) and a straight dict for length of subsignals (excl padding). minus the signed() it is real dead simple. the only thing being as i said, it produces different overall widths depending on the input lane_shape which is a showstopper. if overall width (64 bit) is given as the input it fixes that. does mean a safety check is needed, elwidth=0b11 means 4 partitions, 4 partitions of 24 bit obviously won't fit in 64 bit!
(In reply to Luke Kenneth Casson Leighton from comment #22) > the only thing being as i said, it produces different > overall widths depending on the input lane_shape which > is a showstopper. > > if overall width (64 bit) is given as the input it fixes that. > > does mean a safety check is needed, elwidth=0b11 means > 4 partitions, 4 partitions of 24 bit obviously won't fit > in 64 bit! why go through all that when you can just derive the signal width from the lane shapes -- avoiding wasting extra bits or running out of bits, as well as avoiding the need to manually specify size? Having the size be computed doesn't really cause problems imho...
(In reply to Luke Kenneth Casson Leighton from comment #22) > (In reply to Jacob Lifshay from comment #20) > > a quick readthrough: > > > def layout(elwid, part_counts, lane_shapes): > > using the FP example dict from comment #2, i think you might have > misunderstood: > the idea is not to create a set of *separate* PartitionPoints > one per elwidth, the idea is to create a *merged* set of PartitionPoints > (and associated Signal mask) this is a misunderstanding, which i'm assuming by your later comments you figured out...you should somehow mark/delete sections of your comment that you know are incorrect before you click "send" so people don't have to waste a bunch of time explaining something you already figured out. The code creates *one* PartitionPoints based on an input elwidth signal/value.
(In reply to Luke Kenneth Casson Leighton from comment #22) > (In reply to Jacob Lifshay from comment #20) > > if not isinstance(lane_shapes, Mapping): > > lane_shapes = {i: lane_shapes for i in part_counts} > > lane_shapes = {i: Shape.cast(lane_shapes[i]) for i in part_counts} > > signed = lane_shapes[0].signed > > assert all(i.signed == signed for i in lane_shapes.values()) > > by removing signed all of these lines can go. this reduces the function > by 25%. lane_shapes can become a straight dict containing > numbers, Shape can dral with the signed argument. I think we should keep those lines, since it allows us to completely specify all aspects of a PartitionedSignal's shape/layout/etc. by giving it one Python value, lane_shapes. elwid and part_counts are retrieved from a global, they're the equivalent of SimdPartMode. The idea is that code will look kinda like this: self.elwid = Signal(ElWid) with set_global_in_scope(self.elwid, ElWid.PART_COUNTS): ... self.a = PartitionedSignal(3) # all lanes are unsigned(3) self.b = PartitionedSignal(signed(5)) # all lanes are signed(5) # all lanes are whatever shape StateEnum is self.state = PartitionedSignal(StateEnum) # same shape as self.state, could use .like() or StateEnum directly self.state2 = PartitionedSignal(self.state.layout) self.mantissa = PartitionedSignal({ ElWid.F16: 5, # from memory, probably off ElWid.BF16: 8, ElWid.F32: 8, ElWid.F64: 11, }) ... In particular, notice that you can use nmigen Shapes without needing to write a dict everywhere, greatly reducing the modifications needed to our existing ALUs/FUs/etc.
(In reply to Jacob Lifshay from comment #25) > self.mantissa = PartitionedSignal({ > ElWid.F16: 5, # from memory, probably off > ElWid.BF16: 8, > ElWid.F32: 8, > ElWid.F64: 11, > }) oops, that was waay off...i wrote mantissa but meant to write exponent
(In reply to Jacob Lifshay from comment #23) > why go through all that when you can just derive the signal width from the > lane shapes -- avoiding wasting extra bits or running out of bits, as well > as avoiding the need to manually specify size? Having the size be computed > doesn't really cause problems imho... because the idea is to do a literal global/search/replace "Signal" with "PartitionedSignal" right throughout the entirety of the ALU codebase. and the input argument on all current-Signals is not an elwidth, it's an "overall width". (actually, have a PSpec which dynamically determines at run/compile time *which* class to use: Signal or PartitionedSignal). when i said "there will be NO fundamental changing of the ALU code" i REALLY meant it. this is a hard critical requirement. we do not have time to waste on total disruptive redesigns of the existing ALU codebase, tens of thousands of lines of work and just as many unit tests. this type of "minimal code-changes" is something i learned the hard way 24 years ago. when transitioning to python programming it became even more important because i witnessed massive python projects, 350,000 to a million lines, literally falling apart because of low-level fundamental changes where unit test coverage was either inadequate or in one particular pretty-much-Corporate-suicidal case, non-existent. that one was a bundle of fun: the CTO it turned out was sub-consciously but actively sabotaging efforts to stabilise the codebase, because he himself hadn't been able to stabilise it in 5 years, therefore he didn't see why anyone else could either... oh and therefore made *certain* that was true by changing low-level base classes and committing them to the active development branch. bottom line is: i am responsible for ensuring that the codebase is stable, continuously-operational, and that changes are made with the absolute bare minimum disruption and in an incremental non-disruptive manner. if we want a more complex design it *has* to be done in an *evolutionary* fashion through a chained-series of *SMALL* incremental patches, one after the other. every time you say "but this feature would be great" i have to do a MASSIVE re-evaluation of the knock-on implications down that chain, which i am currently maintaining in my head. this is why it is critically important that, because you haven't been helping with the HDL for over 18 months, and aren't familiar with it, to get back into the flow you help do the task that is needed, which *i* define [i am quite happy to admit that i am slightly embarrassed to have to put it like that]. if you had helped with the HDL codebase full-time, when i made the announcement that TestIssuer was to begin development, you would now have the working knowledge needed. (In reply to Jacob Lifshay from comment #24) > (In reply to Luke Kenneth Casson Leighton from comment #22) > > (In reply to Jacob Lifshay from comment #20) > > > > a quick readthrough: > > > > > def layout(elwid, part_counts, lane_shapes): > > > > using the FP example dict from comment #2, i think you might have > > misunderstood: > > the idea is not to create a set of *separate* PartitionPoints > > one per elwidth, the idea is to create a *merged* set of PartitionPoints > > (and associated Signal mask) > > this is a misunderstanding, which i'm assuming by your later comments you > figured out... i hadn't. > The code creates *one* PartitionPoints based on an input elwidth > signal/value. the usage for-loop confused me from the examples, when it set that elwidth to zero width: for i in range(4): pprint((i, layout(i, part_counts, unsigned(3)))) if i am reading you correctly, that's a zero-width Signal. if the example had been: for i in range(4): i = Signal(i) # Signal of width between 0 and 3 pprint((i, layout(i, part_counts, unsigned(3)))) i would have understood. maybe. and also wondered why a Signal of width zero is being passed in. (In reply to Jacob Lifshay from comment #25) > I think we should keep those lines, since it allows us to completely specify > all aspects of a PartitionedSignal's shape/layout/etc. this defeats the object of the exercise, which is to show you how to do incremental pragmatic development, and to keep "feet-on-the-ground". > by giving it one > Python value, lane_shapes. elwid and part_counts are retrieved from a > global, they're the equivalent of SimdPartMode. The idea is that code will > look kinda like this: > self.elwid = Signal(ElWid) for practical reasons this is always going to be Elwid=2. i get that it would be "nice to have" support for other data types... .... but... they... are... not... critical. if they need to be added, they can be added... *aaaafterr* the first patch lands which adds support for setting Elwidth=2 as a hard-coded parameter. that would be a second incremental patch, and that's fine. but it is NOT fine if it's done even before it's been demonstrated to even be necessary. understand? > with set_global_in_scope(self.elwid, ElWid.PART_COUNTS): > ... > self.a = PartitionedSignal(3) # all lanes are unsigned(3) > self.b = PartitionedSignal(signed(5)) # all lanes are signed(5) > # all lanes are whatever shape StateEnum is > self.state = PartitionedSignal(StateEnum) this is the general idea (but self.state *also* hidden behind the ContextManager) it needs to be more along the lines of: with set_global_in_scope(pspec, self.elwid, ElWid.PART_COUNTS) as ps: self.a = ps.PSpecContextSensitiveSignalType(3) # all lanes are unsigned(3) where from pspec the *actual* type of Signal to be deployed is picked up. that can be **EITHER** a PartitionedSignal (compile-time option 1) **OR** it can be Signal (compile-time option 2) hence why the arguments need to be exactly the same, and why the dsl.Module needs to be exactly the same, and why ast.* need to behave exactly the same. remember this has 18 months of planning behind it. there *will* be *zero* code-modifications to turn the entire ALU set to SIMD. (see below: if that turns out not to be true, we *make* it true. i suspect there are some places where ALU changes might indeed be needed, sigh, i'm *hoping* this turns out not to be the case). > self.mantissa = PartitionedSignal({ > ElWid.F16: 5, # from memory, probably off :) > ElWid.BF16: 8, > ElWid.F32: 8, > ElWid.F64: 11, and yes, this could be one of the exceptions, although i am also hoping that this can be "derived" from the existing (one-line?) dictionary that sets the *scalar* FP pipeline types. i forget exactly where it is. > In particular, notice that you can use nmigen Shapes without needing to > write a dict everywhere, greatly reducing the modifications needed to our > existing ALUs/FUs/etc. except that standard nmigen Signal doesn't *support* those advanced features, neither does the existing codebase use them, therefore it is completely unnecessary to add this as a feature *which we're not going to use*, it is a waste of time. even in the remote possibility that my 18+ month experience of having been the person involved in the development of every single pipeline was not correct, invalidating that assessment, there are two possibilities: 1) we re-evaluate, and we re-evaluate *at the time that it is needed* (which saves wasting time developing something that's not needed, which will only confuse other people "why is this here when it's not needed or used? it's complicated and unnecessary") 2) we make the tough decision to break the rule "no changes to any pipeline code" and AVOID the need entirely. (making sure in the process that at every single step of those changes that the code still operates when PSpec "mode==scalar, please use Signal not PartitionedSignal throughout the entire ALU codebase" is set) bottom line this is all about making the bare minimum code-changes, across what is already an extremely complex set of inter-connected parts.
(In reply to Luke Kenneth Casson Leighton from comment #27) > 2) we make the tough decision to break the rule "no changes to any > pipeline code" and AVOID the need entirely. Well, we'll need to break that rule anyway for FP pipelines, which isn't much of an additional problem since they still all need modification anyway to add F16/BF16 and to handle flags/exceptions. Idea: have a XLEN global that is an instance of a new class SimdMap (a generalization of SimdLayout's dict handling code to arbitrary values per lane-kind -- in this case arbitrary integers) and override *, +, -, etc. so it can do arithmetic like is done in the spec pseudo-code, ending up with a dict wrapped by SimdMap that can be passed into SimdLayout, bit-slicing, shifting, constants, etc. so code could look like this: # XLEN global constant definition XLEN = SimdMap({ElWid.I8: 8, ElWid.I16: 16, ElWid.I32: 32, ElWid.I64: 64}) # example definition for addg6s, basically directly # translating pseudo-code to nmigen+simd. # intentionally not using standard ALU interface, for ease of exposition: class AddG6s(Elaboratable): def __init__(self): with simd_scope(self, IntElWid, make_elwid_attr=True): self.RA = PartitionedSignal(XLEN) self.RB = PartitionedSignal(XLEN) self.RT = PartitionedSignal(XLEN) def elaborate(self, platform): m = Module() with simd_scope(self, IntElWid, m=m): wide_RA = PartitionedSignal(unsigned(4 + XLEN)) wide_RB = PartitionedSignal(unsigned(4 + XLEN)) sum = PartitionedSignal(unsigned(4 + XLEN)) carries = PartitionedSignal(unsigned(4 + XLEN)) ones = PartitionedSignal(XLEN) nibbles_need_sixes = PartitionedSignal(XLEN) z4 = Const(0, 4) m.d.comb += [ wide_RA.eq(Cat(self.RA, z4)), wide_RB.eq(Cat(self.RB, z4)), sum.eq(wide_RA + wide_RB), carries.eq(sum ^ wide_RA ^ wide_RB), ones.eq(Repl(Const(1, 4), XLEN // 4)), nibbles_need_sixes.eq(~carries[0:XLEN-1] & ones), self.RT.eq(nibbles_need_sixes * 6), ] return m > > (making sure in the process that at every single step of those > changes that the code still operates when PSpec "mode==scalar, > please use Signal not PartitionedSignal throughout the entire > ALU codebase" is set) well, when pspec==scalar, all that needs to happen is part_sizes is set to {FpElWid.F64: 1} or {IntElWid.I64: 1}, then only the 64-bit entry in dicts are used and we produce something that should be 100% the same circuit as scalar (except for some names and module boundaries). no need to have everything replace itself with Signal. > > bottom line this is all about making the bare minimum code-changes, > across what is already an extremely complex set of inter-connected > parts.
(In reply to Jacob Lifshay from comment #28) > (In reply to Luke Kenneth Casson Leighton from comment #27) > > 2) we make the tough decision to break the rule "no changes to any > > pipeline code" and AVOID the need entirely. > > Well, we'll need to break that rule anyway for FP pipelines, which isn't > much of an additional problem since they still all need modification anyway > to add F16/BF16 FP16 was already done, it was possible to do full coverage using it, and found bugs which, through parameterisation, meant that FP32 and FP64 didn't have them either. BF16 is a one-liner in the dictionary i mentioned (and can't remember where it is). unit tests are > and to handle flags/exceptions. that's orthogonal and a separate series of patches that have no bearing on this. > Idea: have a XLEN global that is an instance of a new class SimdMap (a > generalization of SimdLayout's dict handling code to arbitrary values jacob: again, that's *another* patch - a seriously big one. it can be considered.... *AFTER* the completion of the first (minimalist) set of changes. the other idea would mean that the ALUs become critically dependent on PartitionedSignal, which may *or may not* have the full capabilities of Signal (Array), and has *not* been designed to be a "scalar Signal replacement". it always creates sub-modules for example. none of those modules have been tested with zero-length (no) partitioning. if used as a single-scalar it would result in significantly extra gates and may not actually work with zero partitions (which is what a scalar is). that would therefore be extra work and, to remind you for about the fiftieth time, we are under time and budgetary pressure. please, please: *focus*. the more time is spent discussing possible future designs and enhancements instead of doing the work that's needed, the more i have to cut drastically back on what can be done. we are running out of time. please.... *focus* on the task that's needed, and listen to the time-saving risk-reducing techniques that took me over 20 years to learn, at considerable cost, in my career.
(In reply to Luke Kenneth Casson Leighton from comment #27) > (In reply to Jacob Lifshay from comment #23) > > this is a misunderstanding, which i'm assuming by your later comments you > > figured out... > > i hadn't. oh, you still misunderstand... > > > The code creates *one* PartitionPoints based on an input elwidth > > signal/value. > > the usage for-loop confused me from the examples, when it set that > elwidth to zero width: > > for i in range(4): > pprint((i, layout(i, part_counts, unsigned(3)))) > > if i am reading you correctly, that's a zero-width Signal. there aren't actually any signals there...if there were a signal, it'd be like: for i in range(4): m = Module() elwid = Signal(2) pp,b,c,d,e = layout(elwid, part_counts, unsigned(3)) def process(): yield elwid.eq(i) pp = yield from pp.get_values() # get nmigen to evaluate pp pprint((i, (pp,b,c,d,e))) sim = Simulator(m) sim.add_process(process) sim.run() the elwid input parameter is a nmigen Value, not a Shape. This is what type annotations would have told you...though it does also need docs...i got it to work right before our meeting and ran out of time.
(In reply to Luke Kenneth Casson Leighton from comment #27) > (In reply to Jacob Lifshay from comment #23) > > > why go through all that when you can just derive the signal width from the > > lane shapes -- avoiding wasting extra bits or running out of bits, as well > > as avoiding the need to manually specify size? Having the size be computed > > doesn't really cause problems imho... > > because the idea is to do a literal global/search/replace "Signal" > with "PartitionedSignal" right throughout the entirety of the ALU > codebase. > > and the input argument on all current-Signals is not an elwidth, > it's an "overall width". I'm 99% sure you're mis-identifying that...the input argument for all current Signals *is* the bit-width of the current lane (aka. elwidth or XLEN) except that our code currently is specialized for the specific case of elwidth=64. This is exactly how all SIMT works (which is exactly what we're trying to do with transparent vectorization). The types and sizes are the type/size of a *single* lane, not all-lanes-mushed-together. > > (actually, have a PSpec which dynamically determines at run/compile > time *which* class to use: Signal or PartitionedSignal). That's just like how SIMT programs can be run un-vectorized. > > when i said "there will be NO fundamental changing of the ALU code" > i REALLY meant it. which is why we need to go all-in on SIMT, since that is the *one vectorization paradigm* that requires minimal modifications to our ALU code.
(In reply to Jacob Lifshay from comment #30) > there aren't actually any signals there...if there were a signal, it'd be > like: > for i in range(4): > m = Module() > elwid = Signal(2) > pp,b,c,d,e = layout(elwid, part_counts, unsigned(3)) > def process(): > yield elwid.eq(i) > pp = yield from pp.get_values() # get nmigen to evaluate pp > pprint((i, (pp,b,c,d,e))) > sim = Simulator(m) > sim.add_process(process) > sim.run() ah *ha*! ok. now i get it. ok so what i envisaged would be two separate functions (one which allocates an actual pp mask after determining the length, followed by a *second* one that would involve a Switch statement on the elwidths, you came up with an elegant solution that merged the two. the key function around which that hinges is add_p(). nice. now. what are the implications of that being a nmigen AST code-fragment rather than a plain slice on a Signal... i'll have to think that through. > the elwid input parameter is a nmigen Value, yep, totally got it. which means if the elwidth is an Enum it still works, and if it's a 3 bit or 4 bit Signal it *still* works. like it.
(In reply to Jacob Lifshay from comment #31) > (In reply to Luke Kenneth Casson Leighton from comment #27) > > (In reply to Jacob Lifshay from comment #23) > > > > > why go through all that when you can just derive the signal width from the > > > lane shapes -- avoiding wasting extra bits or running out of bits, as well > > > as avoiding the need to manually specify size? Having the size be computed > > > doesn't really cause problems imho... > > > > because the idea is to do a literal global/search/replace "Signal" > > with "PartitionedSignal" right throughout the entirety of the ALU > > codebase. > > > > and the input argument on all current-Signals is not an elwidth, > > it's an "overall width". > > I'm 99% sure you're mis-identifying that... yes i was. i've got it now. > the input argument for all > current Signals *is* the bit-width of the current lane (aka. elwidth or > XLEN) except that our code currently is specialized for the specific case of > elwidth=64. yes. in the discussions with Paul and Toshaan i seriously considered an XLEN parameter in the HDL which would propagate from runtime through a PSpec (see test_issuer.py for an example) and would allow us to test a scalar 32 bit Power ISA core, see how many less gates are needed. and, just for laughs, to try and XLEN=16 core. but... time being what it is... > This is exactly how all SIMT works (which is exactly what we're > trying to do with transparent vectorization). The types and sizes are the > type/size of a *single* lane, not all-lanes-mushed-together. there is a lot of misinformation about SIMT. SIMT is standard cores (which may or may not have Packed SIMD ALUs) where they are "normal" cores in every respect *except* that they share one single PC, one single L1 I-cache, one single fetch and decoder, that *broadcasts* in a synchronous fashion that one instruction to *all* cores. the upshot of such a design is that whilst it allows for a much higher computational gate density (only one L1 cache per dozens of cores) it is an absolute bitch and a half to program. not only that, but if you want to run standard general purpose POSIX applications, you can't. you have to *disable* all but one of the cores on the broadcast bus, which of course means absolute rubbish performance. i an not seeing how such an architecture would help, here, and for the above *general purpose* performance and compiler hassle reasons would be very much against pursuing this design as a very first processor. > which is why we need to go all-in on SIMT, since that is the *one > vectorization paradigm* that requires minimal modifications to our ALU code. only if it is acceptable to have the punishment that comes with SIMT, and to then have the exact same XLEN problem unless going with Packed SIMD ALUs, at which point we are back to the exact same issue (with the added complication of degraded general performance and a hell of a lot of compiler work) or, having special separate scalar XLEN=8 and XLEN=16 and separate XLEN=32 scalar cores which are disabled/idle when not given suitable work, which is a variant on the entire reason why PartitionedSignal was created in the first place, to avoid exactly that kind of duplication.
(In reply to Luke Kenneth Casson Leighton from comment #32) > what are the implications of that being a nmigen AST > code-fragment rather than a plain slice on a Signal... > i'll have to think that through. got it. it wasn't actually whether it's a code-fragment or not at all (which comes with inherent risks of AST Expression duplication which yosys can *in no way* properly handle and optimise away) it was the submodules. example: https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_ass/assign.py;h=00cebc6dab33a25f93d4e790dd7883fc859bae9e;hb=HEAD#l88 that Switch, to support elwidths, would become Switch(self.elwidth) the next problem is: how do you now reduce the number of Case() statements needed to four, or, more to the point, on doing so which bits of data do you copy from which partition, without the prerequisite information? turns out that *if* you have the binary values (integer values) associated with the respective elwidth value, that is exactly what is needed. which is why i asked for it (as the 2nd function). now, i believe it may be possible to adapt layout() very easily, by making it a two-stage process. stage 1: add_p becomes: points[p] = points.get(p, []).append(i) stage 2: a post-processing phase that goes through each list: points[p] = map(lambda i: elwidth == i, points[p]) points[p] = reduce(points[p], operator.or_) (or better, use nmutil.treereduce, there) stage 3: a second postprocess on a copy of points which does a bitwise shift-and-or points[p] = map(lambda i: 1<<i, points[p]) points[p] = reduce(points[p], operator.or_) which i *think* should create the binary values needed which represent the settings of when elwidth==i urrr... maybe not. no. drat. it's the order, the bit-position, depending on the length of p. something more like: plist = list(points.keys()) for i in range(max(elwidths)): bitp[i] = 0 for p, elwidths in points.values(): if i in elwidths: bitpos = plist.index(p) bitp[i] |= 1<< bitpos that's more like it. that gives a binary value which should match exactly with the PPs that light up should an elwidth be set to any of its four values.
(In reply to Luke Kenneth Casson Leighton from comment #33) > (In reply to Jacob Lifshay from comment #31) > > > This is exactly how all SIMT works (which is exactly what we're > > trying to do with transparent vectorization). The types and sizes are the > > type/size of a *single* lane, not all-lanes-mushed-together. > > there is a lot of misinformation about SIMT. SIMT is standard cores > (which may or may not have Packed SIMD ALUs) where they are "normal" > cores in every respect *except* that they share one single PC, > one single L1 I-cache, one single fetch and decoder, that *broadcasts* > in a synchronous fashion that one instruction to *all* cores. I don't care how SIMT is traditionally implemented in a GPU, that's totally irrelevant and not what I intended. What I meant is that our HDL would be written like how SIMT is used from a game programmer's perspective -- where if a game programmer writes: float a, b; int c, d; ... a = a > b ? c : d; the gpu actually runs (vectorized with 64 lanes): f32x64 a, b; i32x64 c, d, muxed; boolx64 cond; ... cond = a > b; // lane-wise compare muxed = mux(cond, c, d); a = convert<f32x64>(muxed); and if a programmer were to write (generalizing a bit to support dynamic XLEN): xlen_int_t a, b, c, d; ... a = b + c * d; what would actually run is: vec_with_xlen_lanes_t a, b, c, d; ... for(lane_t lane : currently_active_lanes()) { // C++11 for-each xlen_int_t p = c[lane] * d[lane]; a[lane] = b[lane] + p; } > > the input argument for all > > current Signals *is* the bit-width of the current lane (aka. elwidth or > > XLEN) except that our code currently is specialized for the specific case of > > elwidth=64. > > yes. in the discussions with Paul and Toshaan i seriously considered > an XLEN parameter in the HDL which would propagate from runtime through > a PSpec (see test_issuer.py for an example) and would allow us to test > a scalar 32 bit Power ISA core, see how many less gates are needed. > and, just for laughs, to try and XLEN=16 core. > > but... time being what it is... If we want any chance of matching the spec pseudo-code, we will need xlen-ification of our HDL in some form, since currently it's hardwired for only xlen=64. Why not make it look more like the pseudo-code with arithmetic on an XLEN constant? I could write the class needed for XLEN in a day, it's not that complicated: class SimdMap: def __init__(self, values, *, convert=True): if convert: # convert values to a dict by letting map do all the hard work values = SimdMap.map(lambda v: v, values).values self.values = values @staticmethod def map(f, *args): # TODO: fix bad wording, like builtin map, but for # SimdMap instead of iterables """apply a function `f` to arguments which are one of: * Mapping with ElWid-typed keys * SimdMap instance * or scalar return a SimdMap of the results """ retval = {} for i in ElWid: mapped_args = [] for arg in args: if isinstance(arg, SimdMap): arg = arg.values[i] elif isinstance(arg, Mapping): arg = arg[i] mapped_args.append(arg) retval[i] = f(*mapped_args) return SimdMap(retval, convert=False) def __add__(self, other): return SimdMap.map(operator.add, self, other) def __radd__(self, other): return SimdMap.map(operator.add, other, self) def __sub__(self, other): return SimdMap.map(operator.sub, self, other) def __rsub__(self, other): return SimdMap.map(operator.sub, other, self) def __mul__(self, other): return SimdMap.map(operator.mul, self, other) def __rmul__(self, other): return SimdMap.map(operator.mul, other, self) def __floordiv__(self, other): return SimdMap.map(operator.floordiv, self, other) def __rfloordiv__(self, other): return SimdMap.map(operator.floordiv, other, self) ... XLEN = SimdMap({ElWid.I8: 8, ElWid.I16: 16, ElWid.I32: 32, ElWid.I64: 64}) # layouts really should be a subclass or wrapper over SimdMap # with Shapes as values, but lkcl insisted... def layout(elwid, part_counts, lane_shapes): lane_shapes = SimdMap.map(Shape.cast, lane_shapes).values signed = lane_shapes[ElWid.I64].signed # rest unmodified... assert all(i.signed == signed for i in lane_shapes.values()) part_wid = -min(-lane_shapes[i].width // c for i, c in part_counts.items()) ... ... # now the following works, because PartitionedSignal uses SimdMap on # inputs for shapes, slicing, etc. # example definition for addg6s, basically directly # translating pseudo-code to nmigen+simd. # intentionally not using standard ALU interface, for ease of exposition: class AddG6s(Elaboratable): def __init__(self): with simd_scope(self, IntElWid, make_elwid_attr=True): self.RA = PartitionedSignal(XLEN) self.RB = PartitionedSignal(XLEN) self.RT = PartitionedSignal(XLEN) def elaborate(self, platform): m = Module() with simd_scope(self, IntElWid, m=m): wide_RA = PartitionedSignal(unsigned(4 + XLEN)) wide_RB = PartitionedSignal(unsigned(4 + XLEN)) sum = PartitionedSignal(unsigned(4 + XLEN)) carries = PartitionedSignal(unsigned(4 + XLEN)) ones = PartitionedSignal(XLEN) nibbles_need_sixes = PartitionedSignal(XLEN) z4 = Const(0, 4) m.d.comb += [ wide_RA.eq(Cat(self.RA, z4)), wide_RB.eq(Cat(self.RB, z4)), sum.eq(wide_RA + wide_RB), carries.eq(sum ^ wide_RA ^ wide_RB), ones.eq(Repl(Const(1, 4), XLEN // 4)), nibbles_need_sixes.eq(~carries[0:XLEN-1] & ones), self.RT.eq(nibbles_need_sixes * 6), ] return m
(In reply to Luke Kenneth Casson Leighton from comment #34) > (In reply to Luke Kenneth Casson Leighton from comment #32) > > > what are the implications of that being a nmigen AST > > code-fragment rather than a plain slice on a Signal... > > i'll have to think that through. > > got it. > > it wasn't actually whether it's a code-fragment or not at all > (which comes with inherent risks of AST Expression duplication > which yosys can *in no way* properly handle and optimise away) only if you never run `opt`...expression deduplication is a pretty trivial compiler optimization. In my experience, yosys deduplicates built-in expressions just fine. > > it was the submodules. we can deduplicate expressions ourselves pretty easily: def deduped(f): map = {} def wrapper(*args, **kwargs): key0 = [] for arg in args: key0.append(id(arg)) key1 = [] for k, v in kwargs.items(): key1.append((id(k), id(v))) key = tuple(key0), tuple(key1) if key in map: return map[key][0] retval = f(*args, **kwargs) # keep reference to args and kwargs to avoid ids # getting reused for something else. we can't use weakref, # too many things don't work with it such as `list` and `dict` map[key] = retval, args, kwargs return retval return wrapper class PartitionedSignal: ... @deduped def __add__(self, other): ... @deduped def __sub__(self, other): ... ... > > example: > > https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_ass/ > assign.py;h=00cebc6dab33a25f93d4e790dd7883fc859bae9e;hb=HEAD#l88 > > that Switch, to support elwidths, would become Switch(self.elwidth) > > the next problem is: how do you now reduce the number of Case() statements > needed to four, or, more to the point, on doing so which bits of data do > you copy from which partition, without the prerequisite information? easy, each `part_wid` wide bit slice in any PartitionedSignal can be independently muxed -- as long as the condition is first converted to the layout: { ElWid.I8: 1, ElWid.I16: 1, ElWid.I32: 1, ElWid.I64: 1, } with padding being a sign-extended version of each lane... Aka. the result of PartitionedSignal.bool(). (or what *should* be the result if design were left to me). lanes for elwid=16 0 1 2 3 4 5 6 7 [-u1-pppp][-u1-pppp][-u1-pppp][-u1-pppp] example value for that layout for elwid==16 and lanes of False, True, False, True: 0 1 2 3 4 5 6 7 0 0 1 1 0 0 1 1 m.Case simply is an If/Elif that uses the results of PartitionedSignal.matches() instead of .bool(). a = PartitionedSignal(XLEN - 2) b = PartitionedSignal(3) with m.If(a): with m.Switch(b): with m.Case('1-0'): m.d.comb += c.eq(d) with m.Elif(b == 1): m.d.sync += e.eq(f) compiles to (full trace, with unnecessary assignments omitted): a = PartitionedSignal(XLEN - 2) b = PartitionedSignal(3) # If.enter if_cond = cond_stack[-1] & a.bool() cond_stack.append(if_cond) # If body # Switch.enter switch_value_stack.append(b) # Switch body # Case.enter case_cond = cond_stack[-1] & switch_value_stack[-1].matches('1-0') cond_stack.append(case_cond) # Case body # assign c.eq(d) rhs = d.cast_to(c.layout) for i in range(part_count): with m.If(cond_stack[-1].sig[i]): # cond_stack all have part_wid == 1 # slice computation should just be c.layout.part_slice(i) start = c.layout.part_wid * i s = slice(start, start + c.layout.part_wid) m.d.comb += c.sig[s].eq(rhs.sig[s]) # Case.exit cond_stack.pop() # Switch.exit switch_value_stack.pop() # If.exit else_cond = ~cond_stack.pop() else_cond &= cond_stack[-1] # Elif.enter if_cond = else_cond & (b == 1).bool() cond_stack.append(if_cond) # Elif body # assign e.eq(f) rhs = f.cast_to(e.layout) for i in range(part_count): with m.If(cond_stack[-1].sig[i]): # cond_stack all have part_wid == 1 # slice computation should just be e.layout.part_slice(i) start = e.layout.part_wid * i s = slice(start, start + e.layout.part_wid) m.d.sync += e.sig[s].eq(rhs.sig[s]) # Elif.exit
(In reply to Jacob Lifshay from comment #36) > # assign e.eq(f) > rhs = f.cast_to(e.layout) > for i in range(part_count): > with m.If(cond_stack[-1].sig[i]): # cond_stack all have part_wid == 1 > # slice computation should just be e.layout.part_slice(i) > start = e.layout.part_wid * i > s = slice(start, start + e.layout.part_wid) > m.d.sync += e.sig[s].eq(rhs.sig[s]) > # Elif.exit This works because, for each lane that is enabled, all parts are assigned due to all bits in the condition for that lane being identical due to being a sign-extended bool. padding being assigned along with their corresponding lanes isn't a problem.
(In reply to Jacob Lifshay from comment #35) > I don't care how SIMT is traditionally implemented in a GPU, that's totally > irrelevant and not what I intended. yeah i understood that. i think you likely meant "Packed SIMD" > What I meant is that our HDL would be > written like how SIMT is used from a game programmer's perspective ok. yes. this is more like SIMD programming. > what would actually run is: > vec_with_xlen_lanes_t a, b, c, d; > ... > for(lane_t lane : currently_active_lanes()) { // C++11 for-each > xlen_int_t p = c[lane] * d[lane]; > a[lane] = b[lane] + p; > } yep. definitely SIMD (at the backend) which is what we're discussing: the low-level hardware which is behind "lanes". > If we want any chance of matching the spec pseudo-code, we will need > xlen-ification of our HDL in some form, since currently it's hardwired for > only xlen=64. right. ok. so here, temporarily (i quite like the self.XLEN idea but it is a lot of work to deploy), to *not* have to do that deployment immediately, the trick that can be played is: set the SIMD ALU width to *exactly* the same width as the current scalar width i.e. 64. ta-daa, problem goes away. is it a cheat? yes. will it work? yes. will it save time? yes. will it minimise code-changes? yes. will it do absolutely everything? mmmm probably not but that can be evaluated case by case (incremental patches) in other words: by defining the SIMD ALU width exactly equal to the current scalar ALU width this makes what you would like to see (and called erroneously SIMT) exactly what i have been planning for 18 months. originally if you recall (going back 2+ years) the idea was to *split* the ALUs into *two* 32 bit ALUs that cooperate to do 64 bit arithmetic. but that was actually more about the regfiles, a HI32 regfile and a LO32 regfile. this melted my brain and for practical time reasons it's all 64 bit. > Why not make it look more like the pseudo-code with arithmetic on an XLEN > constant? I could write the class needed for XLEN in a day, it's not that > complicated: yehyeh. my feeling is, this class should derive from Shape. in fact it may have to (i looked at ast.py and there are assumptions that the first argument of e.g. Signal is a Shape() or will be converted to one. by deriving *from* Shape() a self.XLEN can do the division etc. self.XLEN//2 etc that we saw were needed in the pseudocode, they will almost certainly also be needed in the HDL as well, but they would create or carry an appropriate elwidth set as well. the calculation in each of the operator overloads then is based on self.width which comes from Shape.width. i am not entirely certain of all the details here so kinda would like to defer it as long as needed, and go with an "on-demand" approach here. > XLEN = SimdMap({ElWid.I8: 8, ElWid.I16: 16, ElWid.I32: 32, ElWid.I64: 64}) i like it: you are however forgetting that the ALU width also has to be one of the parameters. this is a hard requirement. i explained it in terms of the layout() function returning different widths based on its elwidth parameters but you haven't yet taken on board the significance of the (long) chain-of-logic that goes back to PartitionedSignal internals. if the width varies it creates a massive cost in gate terms by having to Mux e.g. 22 bit PartitionedSignals onto 64 bit PartitionedSignals. if the width does not vary then those are straight wires. yes there wull be blank unused partitions but i worked out how to deal with that. by analysing the dict of elwidth->binary mask values and working out those unused partition pieces, you can actually get all of the submodules behind the various operators to not even bother allocating gates for that piece. example: * PartitionedEq works by creating a set of eqs per piece then *combining* those pieces together depending on the PPoints * if we KNOW that some of those pieces will never be used then just don't do the sub-eqs for those pieces. ta-daa, less gates all round. but for this to work the layout() function *has* to have the width parameter as input, hence the assertions, and that's perfectly fine. we have to prioritise gate count, here, not programmer convenience. > # layouts really should be a subclass or wrapper over SimdMap > # with Shapes as values, but lkcl insisted... > def layout(elwid, part_counts, lane_shapes): > lane_shapes = SimdMap.map(Shape.cast, lane_shapes).values still doesn't have width which is a hard requirement. > # now the following works, because PartitionedSignal uses SimdMap on > # inputs for shapes, slicing, etc. > > # example definition for addg6s, basically directly > # translating pseudo-code to nmigen+simd. > # intentionally not using standard ALU interface, for ease of exposition: > class AddG6s(Elaboratable): > def __init__(self): > with simd_scope(self, IntElWid, make_elwid_attr=True): > self.RA = PartitionedSignal(XLEN) > self.RB = PartitionedSignal(XLEN) > self.RT = PartitionedSignal(XLEN) it will have to be with simdscope(self.pspec, m, self, ....) as ss: self.RA = ss.Signal(pspec.XLEN) which allows simdscope to "pick up" the dynamic runtime compile switch between scalar and simd, but yes. my only concern is that even flipping all the HDL over to substitute XLEN for 64 throughout all 12 pipelines is a big frickin job, at least 10-14 days. it's not something to be taken lightly, and *nothing* else can take place during that time. this is really important. once committed to big regular patterned changes like that, they *have* to be seen through to the end and VERY strictly under no circumstances mixed with any other code changes. the individual pipeline tests as well as test_issuer will all have to be kept uptodate. hence why it is a lot of work because we are talking *TWENTY FIVE OR MORE* unit tests, 12 pipelines, and 5 to 8 ancillary files associated with the core and with regfiles.
jacob i am catching up, i am recovering from chronic pain yesterday and had to do the OpenPOWER Course video today.
(In reply to Jacob Lifshay from comment #36) > only if you never run `opt`...expression deduplication is a pretty trivial > compiler optimization. In my experience, yosys deduplicates built-in > expressions just fine. not on TestIssuer or other complex desigbs it doesn't > > > > it was the submodules. > > we can deduplicate expressions ourselves pretty easily: ooo niiiice, that's really valuable and exciting. i bet a similar trick would be possible to deploy for autocreating intermediary temporary Signals. can you drop this into nmutil? > > the next problem is: how do you now reduce the number of Case() statements > > needed to four, or, more to the point, on doing so which bits of data do > > you copy from which partition, without the prerequisite information? > > easy, it was a rhetorical-ish question as part of the reasoning chain, but hey :) let's go through it. > each `part_wid` wide bit slice in any PartitionedSignal can be > independently muxed -- as long as the condition is first converted to the > layout: > { > ElWid.I8: 1, > ElWid.I16: 1, > ElWid.I32: 1, > ElWid.I64: 1, > } > with padding being a sign-extended version of each lane... Aka. the result > of PartitionedSignal.bool(). (or what *should* be the result if design were > left to me). ah. ok, so with so much code already having been written, and it taking 5 months, and you weren't contributing at the time, we have to be pragmatic and fit with the existing design (and, it's simply good engineering practice to finish something first rather than discard it 90% the way to completion) redesigning the entire suite of submodules (12 and counting) to fit elwidth is both unrealistic and undesirable, but there *is* a way to retrofit by having an adapter - layout() - create the values that the PartPoints mask *should* be set to for any given elwidth. in this way we do *not* have to discard 5 months of work. the example you gave, i apologise i couldn't follow it, it appears to be a different design paradigm which is not the focus of the discussion. tomorrow i will drop the layout() function into the mix and i think i have enough of a handle on things to convert to elwidth. however because it would involve a massive duplication of the test suite, or a disruptive alteration of the unit tests, i'll do it as a separate class that inherits from PartitionedSignal, and think about how to go from there
(In reply to Jacob Lifshay from comment #37) > This works because, for each lane that is enabled, all parts are assigned > due to all bits in the condition for that lane being identical due to being > a sign-extended bool. padding being assigned along with their corresponding > lanes isn't a problem. as a general rule of software engineering, "elegant" tricks either need to be explicitly documented, or simply avoided altogether. the reason is real simple: they're unmaintainable by the average engineer, who doesn't dare touch it for fear of breaking something.
(In reply to Luke Kenneth Casson Leighton from comment #38) > (In reply to Jacob Lifshay from comment #35) > > > I don't care how SIMT is traditionally implemented in a GPU, that's totally > > irrelevant and not what I intended. > > yeah i understood that. i think you likely meant "Packed SIMD" no, I meant SIMT -- explicitly where the programmer writes scalar code and the code is auto-vectorized to a GPU-driver-chosen lane count behind-the-scenes, where there's one instance of the scalar code for each vector lane -- like posix threads in that you can have one instance of the scalar code per thread, and they all run in parallel. In our case, that lane count is determined by the scoped globals. "Packed SIMD" the programmer still has to choose explicit SIMD register widths, explicit masking/predication, explicit SIMD types, etc. > > What I meant is that our HDL would be > > written like how SIMT is used from a game programmer's perspective > > ok. yes. this is more like SIMD programming. not really, it's just somewhat confusing because game programmers often use explicit short vectors too -- kinda orthogonally to the SIMT vectors. > > > what would actually run is: > > vec_with_xlen_lanes_t a, b, c, d; > > ... > > for(lane_t lane : currently_active_lanes()) { // C++11 for-each > > xlen_int_t p = c[lane] * d[lane]; > > a[lane] = b[lane] + p; > > } > > yep. definitely SIMD (at the backend) which is what we're discussing: > the low-level hardware which is behind "lanes". yes, but critically, the vectorization into SIMD step can be skipped (either by setting lane counts to always be one (by setting part_counts={ElWid.I64: 1}), or by bypassing our SIMD operators completely and generating nmigen operators instead) and you end up with the original scalar code where, hey, it uses XLEN everywhere...just tell it to just use the component of XLEN where elwid=64. This allows us to convert everything one module at a time, since SIMT-ified code can easily be used by non-SIMT code by having the globals indicate elwid=64 and part_counts={ElWid.I64: 1}. > > > > If we want any chance of matching the spec pseudo-code, we will need > > xlen-ification of our HDL in some form, since currently it's hardwired for > > only xlen=64. > > right. ok. so here, temporarily (i quite like the self.XLEN idea but it > is a lot of work to deploy), to *not* have to do that deployment immediately, > the trick that can be played is: > > set the SIMD ALU width to *exactly* the same width as the current scalar > width i.e. 64. so, basically you want all code to magically replace 64 with XLEN, and you try to justify that by saying 64 is the SIMD width... I say, that justification is additional unnecessary complication and mental burden, if we take that approach we should still have 64-replaced-with-XLEN be the *lane* width, not the SIMD width. I am taking SIMT seriously and we should write our code such that the SIMD width should essentially not affect how any of our ALU HDL is written. If it turns out that the circuits are smaller if we choose a fixed SIMD width, then that can be handled by the layout algorithm (where that kind of decision belongs) and passed in through another scoped global, *not* by conflating lane-width=64 with SIMD-width=64. That allows us to: 1. set simd-width=128 if we decide 64-bit is too small. 2. set simd-width=32 and limit to max elwid of 32 for a small embedded cpu. 3. set simd-width to some more complex formula based on requested lane sizes so we don't outright fail where our code needs lanes with more than 64-bits, such as int/fp multiplication, or addg6s. I think the magic replace 64 with XLEN *could* work, but just cuz it can work doesn't mean it's a good idea, it makes our code much harder to understand, and it causes issues where we *really need* lane size = 64 not XLEN (such as in the bcd<->dpd instructions). Textually replacing 64 with XLEN when we replace Signal with PartitionedSignal isn't much harder, just one more search-replace-and-check-all-replacements -- that works just as well as the magically replace 64 could possibly work, and it leaves the code waay more understandable. > will it save time? yes. yes, but not much. > will it minimise code-changes? yes. yes, but imho code understandability is faar more important than tiny git diffs. think of it as just a refactoring, big diffs don't mean it's that hard to understand. > will it do absolutely everything? mmmm probably not but that > can be evaluated case by case (incremental patches) > > in other words: by defining the SIMD ALU width exactly equal to the > current scalar ALU width this makes what you would like to see (and > called erroneously SIMT) exactly what i have been planning for 18 months. > > originally if you recall (going back 2+ years) the idea was to *split* > the ALUs into *two* 32 bit ALUs that cooperate to do 64 bit arithmetic. > but that was actually more about the regfiles, a HI32 regfile and a LO32 > regfile. > > this melted my brain and for practical time reasons it's all 64 bit. > > > > Why not make it look more like the pseudo-code with arithmetic on an XLEN > > constant? I could write the class needed for XLEN in a day, it's not that > > complicated: > > yehyeh. my feeling is, this class should derive from Shape. in fact it > may have to (i looked at ast.py and there are assumptions that the > first argument of e.g. Signal is a Shape() or will be converted to one. I think it should *specifically not* derive from Shape, because it's more general than just Shapes -- it can have values that are: * ints -- how it's used in XLEN * patterns -- for .matches() and Case() * Values -- for when we're converting from scalars to SIMD and which scalar we use depends on elwid. * Shapes -- input to layout() * slices -- bit slicing * more > by deriving *from* Shape() a self.XLEN can do the division etc. that has nothing to do with deriving from Shape. > self.XLEN//2 > etc that we saw were needed in the pseudocode, they will almost certainly > also be needed in the HDL as well, but they would create or carry an > appropriate > elwidth set as well. Well, how i envisioned it is SimdMap has a separate value for each possible elwid, allowing us to calculate a different value for each possible elwid -- implemented internally using a dict, hence the Map in SimdMap. The way you're thinking of likely won't actually work since elwid is a Signal, and you can't use a Signal as a Shape. If you base it on .width, then you've locked yourself into only being able to calculate the value for one specific elwid value, the one that matches .width. > > the calculation in each of the operator overloads then is based on self.width > which comes from Shape.width. > > i am not entirely certain of all the details here so kinda would like to > defer it as long as needed, and go with an "on-demand" approach here. > > > XLEN = SimdMap({ElWid.I8: 8, ElWid.I16: 16, ElWid.I32: 32, ElWid.I64: 64}) > > i like it: you are however forgetting that the ALU width also has to be one > of the parameters. that's passed in via the scope globals, not via a SimdMap. > this is a hard requirement. i can see how varying SIMD-width could need more gates, i think that should be handled in layout via scope globals if necessary, not via messing up SimdMap. SimdMap is purely a map from elwidths to Python values. The type that *would* have SIMD-width, is SimdLayout, which layout() should be morphed into the __init__ for. SimdLayout could be a Shape, but I think it's likely better to have a .simd_shape() or just .shape() method... not deriving from Shape avoids possible confusion between the full-SIMD-width and the lane width. There is a helper dataclass: SimdLane. it contains an ElWid enumerant and also a lane index or part index or something. The idea is that a SimdLane instance specifies a particular lane in a SimdSignal/SimdLayout. SimdLayout would have fields, properties, or whatever, for all of: * lane_shapes -- either a dict or a SimdMap, tbd. * padding_kind -- is the padding a sign-extended version of the corresponding lane? is it zeros? ones? Repl(lane_bits, ...)? other? Defaults to other. * bits_per_part -- each lane is made of 1 or more parts (determined by part_counts), all parts are always the same bit-width. * signed -- signedness of all lanes * simd_shape -- could always be unsigned, since signedness doesn't really matter here. * part_counts -- number of parts per lane for each elwid, determines which elwid are supported. This, along with all the other __init__ parameters (excluding lane_shapes and padding_kind) should really be tossed into one globals object (it's SimdPartMode in simd_signal) that is the scoped global's value. * other parameters that determine layout. methods: * get_lane_start(lane) -- returns the starting bit index of `lane` in the SIMD Signal. lane is a SimdLane. * get_lane_width(lane, include_padding=False) -- returns the bit width of `lane` in the SIMD Signal. lane is a SimdLane. * get_lane_slice(lane, include_padding=False) -- returns a slice() for the bits of `lane` in the SIMD Signal. lane is a SimdLane. * lanes() -- gets all `SimdLane`s for this layout. * lanes_for_elwidth(elwid) -- gets all `SimdLane`s for this layout for a particular ElWid enumerant. * elwidths() -- gets all ElWid values supported by this layout ... essentially part_counts.keys() * more... > > # example definition for addg6s, basically directly > > # translating pseudo-code to nmigen+simd. > > # intentionally not using standard ALU interface, for ease of exposition: > > class AddG6s(Elaboratable): > > def __init__(self): > > with simd_scope(self, IntElWid, make_elwid_attr=True): > > self.RA = PartitionedSignal(XLEN) > > self.RB = PartitionedSignal(XLEN) > > self.RT = PartitionedSignal(XLEN) > > it will have to be > > with simdscope(self.pspec, m, self, ....) as ss: > self.RA = ss.Signal(pspec.XLEN) ooh, lookie, simdscope gets self as an argument ... it can read self.pspec itself! XLEN should be a global constant imho. > which allows simdscope to "pick up" the dynamic runtime compile switch > between scalar and simd, but yes. > > my only concern is that even flipping all the HDL over to substitute XLEN > for 64 throughout all 12 pipelines is a big frickin job, at least 10-14 days. it happens at the same time as our replacing Signal with PartitionedSignal, so it won't be much more work, imho. > it's not something to be taken lightly, true. > and *nothing* else can take place > during that time. false...luckily, if the globals set part_counts={ElWid.I64: 1}, then our modules should be nearly 100% compatible with not-yet-SIMT-ified modules, meaning that we can do it one module at a time. > the individual pipeline tests as well as test_issuer will all have to be > kept uptodate. > > hence why it is a lot of work because we are talking *TWENTY FIVE OR MORE* > unit tests, 12 pipelines, and 5 to 8 ancillary files associated with the > core and with regfiles. yes, I've done similarly-sized refactors on my own code before. they took several days.
(In reply to Luke Kenneth Casson Leighton from comment #40) > (In reply to Jacob Lifshay from comment #36) > > > only if you never run `opt`...expression deduplication is a pretty trivial > > compiler optimization. In my experience, yosys deduplicates built-in > > expressions just fine. > > not on TestIssuer or other complex desigbs it doesn't hmm, it probably only works within each module, not across modules. > > > > > > > it was the submodules. > > > > we can deduplicate expressions ourselves pretty easily: > > ooo niiiice, that's really valuable and exciting. > > i bet a similar trick would be possible to deploy for autocreating > intermediary temporary Signals. > > can you drop this into nmutil? sure! Python's GC just had a heart-attack at the sight of it, but oh well. Good thing we don't have long-running programs that need PartitionedSignal. > > > > the next problem is: how do you now reduce the number of Case() statements > > > needed to four, or, more to the point, on doing so which bits of data do > > > you copy from which partition, without the prerequisite information? > > > > easy, > > it was a rhetorical-ish question as part of the reasoning chain, but hey :) > let's go through it. > > > each `part_wid` wide bit slice in any PartitionedSignal can be > > independently muxed -- as long as the condition is first converted to the > > layout: > > { > > ElWid.I8: 1, > > ElWid.I16: 1, > > ElWid.I32: 1, > > ElWid.I64: 1, > > } > > with padding being a sign-extended version of each lane... Aka. the result > > of PartitionedSignal.bool(). (or what *should* be the result if design were > > left to me). > > ah. ok, so with so much code already having been written, and it taking 5 > months, and you weren't contributing at the time, we have to be pragmatic > and fit with the existing design (and, it's simply good engineering > practice to finish something first rather than discard it 90% the > way to completion) Luckily, the exact bit layout resulting from bool matches the layout I wanted, all that needs to happen is to declare all but each lane's lsb as sign-extended padding bits by replacing the result's layout. no additional gates needed. > > redesigning the entire suite of submodules (12 and counting) to fit elwidth > is both unrealistic and undesirable, but there *is* a way to retrofit by > having an adapter - layout() - create the values that the PartPoints > mask *should* be set to for any given elwidth. oh, yeah, part_points is another member needed for SimdLayout. oops, forgot that. > in this way we do *not* have to discard 5 months of work. > > the example you gave, i apologise i couldn't follow it, it appears > to be a different design paradigm which is not the focus of the > discussion. oh, it is how to handle m.If/Switch/Case with part-wise assignment. it uses the approach of translating all but the final assignment to just data-flow ahead-of-time, letting yosys handle the final assignment translation. > tomorrow i will drop the layout() function into the mix and i think > i have enough of a handle on things to convert to elwidth. oh, i can do that...i'll convert PartitionedSignal + unit tests to use SimdLayout. I'd guess it'll take me the rest of today. I'll be busy all weekend, so hopefully that won't be too much of a problem... > however because it would involve a massive duplication of the > test suite, or a disruptive alteration of the unit tests, i'll do it > as a separate class that inherits from PartitionedSignal, and > think about how to go from there iirc nothing actually uses PartitionedSignal yet, so a API break to using SimdLayout shouldn't be as big a problem.
(In reply to Jacob Lifshay from comment #43) > (In reply to Luke Kenneth Casson Leighton from comment #40) > > tomorrow i will drop the layout() function into the mix and i think > > i have enough of a handle on things to convert to elwidth. > > oh, i can do that...i'll convert PartitionedSignal + unit tests to use > SimdLayout. I'd guess it'll take me the rest of today. I'll be busy all > weekend, so hopefully that won't be too much of a problem... I'll push it to a separate branch to avoid breaking master if I don't finish before the weekend and to allow more review.
(In reply to Jacob Lifshay from comment #43) > oh, i can do that...i'll convert PartitionedSignal + unit tests to use > SimdLayout. whoawhoa hang on, only do that if you're happy that SimdLayout should derive from ast.Shape, that layout() should take a dict *only* (not have signed/unsigned because, obviously, by deriving from ast.Shape() SimdLayout *already has* a signed member, and *only* if you can guarantee to have it 100% completed within one day. the reason is because if not i have to completely stop all ongoing work on PartitionedSignal because it becomes too much to merge new submodules into the altered code. (i am currently investigating how to add __getattr__, Part, and Slice, all of which are interrelated) welcome to planning on large complex projects, sigh.
(In reply to Jacob Lifshay from comment #44) > I'll push it to a separate branch to avoid breaking master if I don't finish > before the weekend and to allow more review. can i suggest first please do an ElwidthPartitionedSignal that derives from PartitionedSignal, and use the layout() function as an adapter between the two. perhaps write a simple *small* proof-of-concept / demo unit test to confirm functionality this will be a hell of a lot less disruptive to ongoing work on PartitionedSignal and will allow some scheduling shenanigens to take place
hit send too soon, doh. from comment #34 the function at the end effectively does this (actuslly this can be its unit test): ppoints, pdict = layout(64, elwidth, ...) for ew in range(4): yield elwidth.eq(ew) maskvals = yield Cat(ppoints.mask.values()) assert maskvals == pdict[ew] this is quite crucial
(In reply to Jacob Lifshay from comment #43) > oh, it is how to handle m.If/Switch/Case with part-wise assignment. it uses > the approach of translating all but the final assignment to just data-flow > ahead-of-time, letting yosys handle the final assignment translation. ok i suspected you might have jumped context, thank you for confirming. i was referring to an internal use of a Type 2 (dsl.Module) Switch for a submodule, *not* to the implementation *of* PartitionedSignal.__Switch__ the strategy there for PartitionedSignal.__Switch__ is, well, quite mindbending, and also out of scope for thiw bugreport. for completeness however here is the strategy, which should be discussed there, not here https://bugs.libre-soc.org/show_bug.cgi?id=458#c10 > iirc nothing actually uses PartitionedSignal yet, so a API break to using > SimdLayout shouldn't be as big a problem. except that there may be good reasons why other people (other customers whom i cannot reveal or discuss because it is commercially confidential) who might wish to use PartitionedSignal *as it is* i.e. who may not wish to use Elwidths or who may wish to define their own adapter or class on *top* of PartitionedSignal, or whom may wish to have *all* the mask points available not the subset as fictated by Elwidths. basically i am saying it is too early to go destroying the existing PartitionedSignal API and wholesale converting all of the unit tests to ElwidthPartitionedSignal. one *small* step at a time.
(In reply to Jacob Lifshay from comment #43) > (In reply to Luke Kenneth Casson Leighton from comment #40) > > (In reply to Jacob Lifshay from comment #36) > > > we can deduplicate expressions ourselves pretty easily: > > > > ooo niiiice, that's really valuable and exciting. > > > > i bet a similar trick would be possible to deploy for autocreating > > intermediary temporary Signals. > > > > can you drop this into nmutil? > > sure! Python's GC just had a heart-attack at the sight of it, but oh well. > Good thing we don't have long-running programs that need PartitionedSignal. Added: https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;h=fbb284ecf93ead5d748f9c19e5b1b899a06c6b55 I also added support for having globals go into the key, since we need to include the scoped globals there: my_global = 42 @deduped(global_keys=[lambda: my_global]) def fn_with_global(a, *, b=1): ... so we could use it like: class PartitionedSignal(...): ... @deduped(global_keys=[get_simd_scope_values]) def __add__(self, rhs): ... ... though to improve deduping, we may want to do: class PartitionedSignal(...): ... @staticmethod @deduped(global_keys=[get_simd_scope_values]) def _do_add_impl(*args): # put __add__ and __radd__'s guts here @staticmethod def _do_add(*args): args = list(args) # consistent argument order to handle commutativity args.sort(key=id) return PartitionedSignal._do_add_impl(*args) def __add__(self, other): return PartitionedSignal._do_add(self, other) def __radd__(self, other): return PartitionedSignal._do_add(other, self) ... I also added support for opportunistically using weakref.
(In reply to Jacob Lifshay from comment #49) > Added: > https://git.libre-soc.org/?p=nmutil.git;a=commitdiff; > h=fbb284ecf93ead5d748f9c19e5b1b899a06c6b55 nice. that's really well documented. can you raise a separate bugreport about it and vf here and also 721(?) so they both can get a budget?
(In reply to Luke Kenneth Casson Leighton from comment #48) > (In reply to Jacob Lifshay from comment #43) > > iirc nothing actually uses PartitionedSignal yet, so a API break to using > > SimdLayout shouldn't be as big a problem. > > except that there may be good reasons why other people (other customers > whom i cannot reveal or discuss because it is commercially confidential) can you tell me privately? i was not aware of anyone using it... I'm kinda inclined to say "you decided to use unstable not even released software, you reap the consequences of said software changing", but I recognize that that's probably not helpful... > who might wish to use PartitionedSignal *as it is* i.e. who may not > wish to use Elwidths or who may wish to define their own > adapter or class on *top* of PartitionedSignal, or whom may > wish to have *all* the mask points available not the subset as > fictated by Elwidths. > > basically i am saying it is too early to go destroying the > existing PartitionedSignal API and wholesale converting all > of the unit tests to ElwidthPartitionedSignal. Well, if we do work in a separate branch, we can see if the design will pan out. if not, we can use pieces of the code in PartitionedSignal 3.0 (since simd_signal is kinda PartitionedSignal 2.0.0.alpha). I think copying the existing PartitionedSignal and naming the copy ElwidthSimdSignal is better than having it be a subclass...subclassing will make a bunch of messy unnecessary interactions. Also, naming it Simd is waay shorter than Partitioned. I did kinda run out of time today with getting @deduped to work, so will put off writing SimdLayout till next week. Since I'm the one pushing for it, I'll handle the work needed to integrate it.
(In reply to Luke Kenneth Casson Leighton from comment #50) > (In reply to Jacob Lifshay from comment #49) > > > Added: > > https://git.libre-soc.org/?p=nmutil.git;a=commitdiff; > > h=fbb284ecf93ead5d748f9c19e5b1b899a06c6b55 > > nice. that's really well documented. > > can you raise a separate bugreport about it and vf here and also 721(?) > so they both can get a budget? vf??? idk what that means... I'll let you add the appropriate budget, see alsos, etc. @deduped bug: https://bugs.libre-soc.org/show_bug.cgi?id=722
(In reply to Jacob Lifshay from comment #51) > I think copying the existing PartitionedSignal and naming the copy > ElwidthSimdSignal is better than having it be a subclass...subclassing will > make a bunch of messy unnecessary interactions. it's surprisingly small (and regular), needing only 3 functions/properties https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_repl/repl.py;h=364b7721c9ec896e9e5d0cbf133404f3631ad059;hb=HEAD#l90 1) the thing to switch on (line 90) 2) the list of values to create cases from (line 92) 3) (as an optimisation) a mask identifying padding-only lanes, stopping partial computations in padding lanes from taking place. note 1 and 2 are *not* PartitionedSignal.__Switch__ they are actual (scalar) Module m.Switch m.Case. lines 90-92 in repl.py would become something like: with m.Switch(self.something.get_selector()): for pbit in self.something.get_allowed_cases(): that is *literally* the extent of the modifications required to allow PartitionedSignal submodules/subfunctions to adapt. > Also, naming it Simd is waay > shorter than Partitioned. yeah i'll do a (big) rename in the morning. SimdSignal?
(In reply to Jacob Lifshay from comment #52) > vf??? idk what that means... it means it was late, i was tired, and hit totally the wrong letters :) hilariously i think it was supposed to be the word "by" (In reply to Jacob Lifshay from comment #51) > (In reply to Luke Kenneth Casson Leighton from comment #48) > > except that there may be good reasons why other people (other customers > > whom i cannot reveal or discuss because it is commercially confidential) > > can you tell me privately? i was not aware of anyone using it... done. for the archives (and if there is an Audit), we should be able to make an announcement soon. > I'm kinda inclined to say "you decided to use unstable not even released > software, you reap the consequences of said software changing", but I > recognize that that's probably not helpful... fortunately the partsig nmigen branch is a ridiculously small 400 line diff. forward-merging indefinitely is a minor task if we are forced into that situation by intransigent behaviour. whitequark's objections here were based on lack of knowledge of confidential business discussions underway at the time. combined with input from "community members" who have no experience in Supercomputing Arxhitecture stating "i have absolutely no need for that in *my* code". it allowed her to justify stating "there are clearly no use-cases beyond Libre-SOC. case dismissed". which is as shocking a judgement as it sounds, when summarised. she also neglected to inform us of the results of a private discussion which reneged on an agreement to follow an RFC process, only informing us months late, right when this is critical path. what looks to her like "trying to force our code onto another project" is down to being backed into a corner and left with no other options to meet our contractual obligations to NGI POINTER in time and within budget. we have to document these things publicly unfortunately because of the transparency obligations we're under. there was a slashdot article only a couple days ago with the results of a study on the effects of power: power does genuinely corrupt, causing measurable neuro-degenerative effects! the lack of empathy resulting from constantly making autocratic decisions apparently quite literally causes brain damage over a long enough period. on reading the article, it definitely made sense of a lot of behavioural patterns i've seen in so many community projects. quite an eye-opener.
https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/ i'm writing out notes based on conversations and examples here. it's got big enough, needs some clarity.
(In reply to Luke Kenneth Casson Leighton from comment #53) > yeah i'll do a (big) rename in the morning. SimdSignal? realised it's best to wait for you respond before acting. i am still mulling over how and whether Par...^W SimdSignal can be adapted to support elwidths but also as a default option create full power-2 partitions and cover all permutations currently covered by PartitionedSignal. if so then yes, PartitionedSignal conversion can go ahead. the decision revolves around whether the (small!) adapter described in comment #53 can "emulate" existing PartitionedSignal behaviour, and i believe the answer is yes by: 1) returning Cat(PartitionPoints.mask.values()) as the switcher 2) returning range(len(PartitionPoints)) as the caser 3) returning *zero* as the "blanker" mask to indicate that there are no dead (padding) partitions. [edit: the answer's yes] https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=5a1c91609c64eace13bbb46902dba277e318aa9e https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=4bbf1d04380bd2f15f8fc4cfb7537b9144a23f81 https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=5a81a6e7a0ae48119097928f15d4ab7a44192afa
(In reply to Luke Kenneth Casson Leighton from comment #54) > (In reply to Jacob Lifshay from comment #52) > > I'm kinda inclined to say "you decided to use unstable not even released > > software, you reap the consequences of said software changing", but I > > recognize that that's probably not helpful... > > fortunately the partsig nmigen branch is a ridiculously small 400 line diff. I had meant that PartitionedSignal wasn't stable nor released, I wasn't saying anything about nmigen. Next time, if possible, please at least email me, since I probably would have said PartitionedSignal probably isn't ready yet.
(In reply to Luke Kenneth Casson Leighton from comment #56) > (In reply to Luke Kenneth Casson Leighton from comment #53) > > > yeah i'll do a (big) rename in the morning. SimdSignal? go ahead and rename. I'll figure stuff out on monday. > > realised it's best to wait for you respond before acting. :) thx. > i am still mulling over how and whether Par...^W SimdSignal > can be adapted to support elwidths but also as a default > option create full power-2 partitions and cover all > permutations currently covered by PartitionedSignal. oh, well all permutations currently covered by PartitionedSignal (at least add/sub/logic-ops) are waay more than just the aligned, power-of-2 lanes. I have an idea where there's a signal like SimdPartMode (which is: Simd split into abstract "parts", which are combined to create lanes -- 1-bit signal per part determines if a lane starts there), except each part_starts signal is an enum: class FpPartStartKind(Enum): NotAStart = 4 # only one with bit set, hopefully simplifying logic F16 = 1 # adjust to match elwid's definition BF16 = 2 F32 = 3 F64 = 0 this allows defining lanes like so (128-bit Simd so I can demo more combinations): part_starts = [ FpPartStartKind.F16, # 1-part F16 lane in part 0 FpPartStartKind.BF16, # 1-part BF16 lane in part 1 FpPartStartKind.F32, # 2-part F32 lane in parts 2 and 3 FpPartStartKind.NotAStart, # same lane as above FpPartStartKind.F64, # 4-part F64 lane in parts 4, 5, 6 and 7 FpPartStartKind.NotAStart, # same lane as above FpPartStartKind.NotAStart, # same lane as above FpPartStartKind.NotAStart, # same lane as above ] This may take a little more work to decode, and be more wires for part_starts, but it covers everything that both the simd_signal code I wrote and the layout() in comment #20 can do. it covers everything PartitionPoints can do, too, though we'll probably still want to restrict lanes to be aligned, with power-of-two part counts (since the part-width can be non-power-of-2, it can still save space that way -- e.g. a lane can be 4 parts wide, since 4 is a power-of-2, and parts can be 3-bits wide, allowing the lane to be 12-bits wide since 12=4*3). This layout can be converted to PartitionPoints, by just setting the partition bit i * part_width for each part_starts[i] != NotAStart. simd_signal.SimdPartMode.is_valid can be adapted to create a circuit used to feed into an Assume to get only valid part_starts combinations (checks that lanes are sized correctly, and are aligned correctly).
(In reply to Jacob Lifshay from comment #58) > This layout can be converted to PartitionPoints, by just setting the > partition bit i * part_width for each part_starts[i] != NotAStart. more logic will be required to add partition points to split padding out, the above algorithm produces partitions that include their corresponding padding.
(In reply to Jacob Lifshay from comment #58) > (In reply to Luke Kenneth Casson Leighton from comment #56) > > i am still mulling over how and whether Par...^W SimdSignal > > can be adapted to support elwidths but also as a default > > option create full power-2 partitions and cover all > > permutations currently covered by PartitionedSignal. > > oh, well all permutations currently covered by PartitionedSignal (at least > add/sub/logic-ops) are waay more than just the aligned, power-of-2 lanes. yes, i know. that's covered in https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/ and it is covered by the enhancements to the layout() adapter function i describe in comment #53. the example in the (new) wiki page is a cleanup of the example in comment #2 there are actually NINE (nine!) underlying partition points created by that innocuous-looking example. so it looks like we need a completely insane 2^9 (512) permutations, right? wrong. there are only three in that example. * elwidth=32 * elwidth=16 * elwidth=8 and that's where the adapter comes into play. it defines what the switch is, it defines the cases. you DO NOT need a switch taking a 9-bit input you DO NOT need a massive batch of 512 case statements i wrote an adapter which covers the existing PartitionedSignal behaviour: please take a look at the patches from comment #56 to understand it. the "Elwdith" adapter is conceptually finished as well, based on the code in comment #53. > This may take a little more work to decode, and be more wires for > part_starts, but it covers everything that both the simd_signal code I wrote > and the layout() in comment #20 can do. it covers everything PartitionPoints > can do, too, jacooob: we're *not* throwing away code, we're *adapting* and slowly morphing *existing* code, one incremental step at a time. this is the project's base-line operating protocol this means that we go from one known-tested to another known-tested without scaring ourselves by taking on too much. it does mean however that you need to have a full working knowledge of the *existing* design. > though we'll probably still want to restrict lanes to be > aligned, the Elwidth adapter from comment #53 is predicated on the fundamental assumption that all elements start on power-two boundaries, yes. > simd_signal.SimdPartMode.is_valid can be adapted to create a circuit used to > feed into an Assume to get only valid part_starts combinations (checks that > lanes are sized correctly, and are aligned correctly). you're confusing me with a new design when my entire mindset is focussed on the existing design. if you describe things to me in terms of how the EXISTING design is code-morphed *INTO* the proposed new design - not replaced by - CHANGED INTO - i will be able to understand. feet on the ground, jacob. rename is done, btw. commit df93b8f37bea53917d67e87ce2190981bc2ef667 (HEAD -> master, origin/master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Sun Oct 10 11:35:20 2021 +0100 big rename PartitionedSignal to SimdSignal (shorter)
(In reply to Jacob Lifshay from comment #59) > (In reply to Jacob Lifshay from comment #58) > > This layout can be converted to PartitionPoints, by just setting the > > partition bit i * part_width for each part_starts[i] != NotAStart. > > more logic will be required to add partition points to split padding out, > the above algorithm produces partitions that include their corresponding > padding. that's what the code in comment #53 is there to identify, and the "blanking" mask is the OR of all of the unused (padding) partitions under each case (elwidth=0b00/0b01/0b10/0b11). (actually probably the AND, but hey). the adapter is actually real simple.
https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=d90d9ca5a88c3f6adf8fb8d10d369fc89d803382 currently going through picking up bits of code from the layout() experiment and dropping them in here https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;hb=HEAD it's evolving but i'm trying to keep the order the same as the conversation
this adds the phase 3 code from comment #34 https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=f91e3bcfecaa7d980327ff531e559e09513b524a phase 1 and 2 was here https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=95a5db6ed94749ca2079fa572030040b6a0235ee and checking it, from comment #47: https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=01f99ad3388d10f41c784bbc0a2bd4bf1eec777b added *option* to fix the layout width, that's from comment #22: https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=9d2cacebdf7ccaef9307903e9cba9780d766cf5f going well - one incremental change at a time, morphing the code one commit at a time, to incorporate each idea one at a time.
(In reply to Jacob Lifshay from comment #42) > > and *nothing* else can take place > > during that time. > > false...luckily, if the globals sorry, re-reading, you misunderstood: i was talking about a workflow / procedure, not code. during massive regularised code-changes no *other work* can take place (everything grinds to a halt) because to try to mix in other work in the middle of a massive regular systematic batch of code-changes results in code-merges that are too insane to manage. big regular systematic renames are an example.
okaay, so *now* we have something that can act as an adapter (and more like a derivative of Shape()) https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=4f185874f40228471434679d1c2ffa1b23f9f3ca;hb=7e61aab66f2e444e83fd564e85dc957072ca1be4 it's also got code-comments explaining what's going on. jacob: i *deduced* that you added the ability to specify the same width for all partitions, after the fact: 32 if not isinstance(lane_shapes, Mapping): 33 lane_shapes = {i: lane_shapes for i in part_counts} i was not able when initially reading the code from comment #20 to work that out. i therefore added three comment lines: 29 # identify if the lane_shapes is a mapping (dict, etc.) 30 # if not, then assume that it is an integer (width) that 31 # needs to be requested across all partitions you can see where it's going, how the adapter fits: https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/partsig.py;h=bbad089360a070304dc2c5f1d38dfb29e6c56a6c;hb=7e61aab66f2e444e83fd564e85dc957072ca1be4#l84 in the ElwidthPartType adapter, the switch is on the elwidth and the cases is the bitp dictionary. still TODO in layout() is to merge all of the bitp dictionary values to determine which parts of the partition are padding at *all* times.
https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=48f1e8c97dd4adc9d9161e362f389fbf70506033 + # fourth stage: determine which partitions are 100% unused. + # these can then be "blanked out" + bmask = (1<<len(plist))-1 + for p in bitp.values(): + bmask &= ~p done, but ironically (because the elwidth=0b11 partition is 4x24-bit) there *are* no unused portions in the example. next stage will be to add an example which does have blank unused padding parts through all elwidths
ah! i realised a couple of things: 1) setting a fixed width needs to be *optional* not mandatory. we need *both* the option to specify that the width is to be per *element* (lane) as a fixed quantity *and* the option to set the overall width 2) and another option to say that based on the fixed width and based on the partition counts the element widths should be automatically calculated def layout(elwid, part_counts, lane_shapes=None, fixed_width=None): if lane_shapes is None: lane_shapes = {} for p, c in part_counts.items(): # calculate width of each partition lane_shapes[p] = lane_width // c # or would this work? # lane_shapes = {i: lane_width // lane_shapes[i] for i in part_counts} if not isinstance(lane_shapes, Mapping): lane_shapes = {i: lane_shapes for i in part_counts} of course an assert if both lane_shapes and fixed_width are None. basically, the actual registers are 64 bit so we *have* to have a way to declare explicit subdivisions of exactly 64 bit. but, for intermediary signals you want to say (such as for shift) with absolute certainty that what you are using to shift is a fixed width for all elwidths.
looking through the code, you got what part_counts is mixed up: https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=1f0b761a019fae0fd0005a52649c5f5ad080bea2;hb=48f1e8c97dd4adc9d9161e362f389fbf70506033#l94 You wrote: > 94 # for each element-width (elwidth 0-3) the number of partitions is given > 95 # elwidth=0b00 QTY 1 partitions: | ? | > 96 # elwidth=0b01 QTY 1 partitions: | ? | > 97 # elwidth=0b10 QTY 2 partitions: | ? | ? | > 98 # elwidth=0b11 QTY 4 partitions: | ? | ? | ? | ? | > 99 # actual widths of Signals *within* those partitions is given separately > 100 part_counts = { > 101 0: 1, > 102 1: 1, > 103 2: 2, > 104 3: 4, > 105 } part_counts is actually a dict from elwid values to the number of parts *in a single lane*, *not* the number of parts in the whole SIMD signal. So, for the ALU's inputs/outputs (for the integer ALUs), part_counts' values are the number of bytes (since a byte is the smallest integer elwid) in each elwid type: part_counts = { IntElWid.I8: 1, # i8 takes 1 byte IntElWid.I16: 2, # i16 takes 2 bytes IntElWid.I32: 4, # i32 takes 4 bytes IntElWid.I64: 8, # i64 takes 8 bytes } For the FP ALUs' inputs/outputs, part_counts' values are the number of 16-bit words, since the smallest FP elwid is 16-bits: part_counts = { FpElWid.F16: 1, # f16 takes 1 16-bit word FpElWid.BF16: 1, # bf16 takes 1 16-bit word FpElWid.F32: 2, # f32 takes 2 16-bit words FpElWid.F64: 4, # f64 takes 4 16-bit words }
reading through what you wrote on the wiki: https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/ You appear to want to keep forcing SimdShape into being a Shape, and that causes problems with SimdShape's current design... SimdShape can't be an instance of Shape as you are trying to do, because Shape.width is the width of a nmigen scalar Signal/Value, when vectorizing scalar code, Shape.width ends up being the lane width (excluding padding). The problem arises because lanes have different widths for different elwid values, so there *isn't* a single consistent integer value for Shape.width. Shape.width *is not* the same as SimdShape.width, since SimdShape.width is the width of the whole SIMD signal, which is, from the perspective of user code in an ALU, an arbitrary multiple of Shape.width. That multiplication factor is determined by the calling code and can arbitrarily vary (e.g. we could decide that we want 256-bit wide SIMD ALUs, or we could decide that we want to make a tiny 32-bit cpu and we want 32-bit wide SIMD ALUs (which have 0 64-bit lanes, 1 32-bit lane, 2 16-bit lanes, or 4 8-bit lanes)). A possible solution that keeps SimdShape as a subclass of Shape is to rename the current SimdShape.width to something like .simd_width, allowing us to instead have SimdShape.width be a SimdMapInt (a subclass of SimdMap and int) of lane widths (the .width components of lane_shapes), that would meet the semantic requirements of Shape. e.g.: class SimdShape(Shape): # aka. SimdLayout def __init__(self, lane_shapes): if isinstance(lane_shapes, SimdShape): self.lane_shapes = lane_shapes.lane_shapes else: # convert to SimdMap with Shapes for values, handles # input dicts, SimdMaps, Shapes, ints, Enums, etc. self.lane_shapes = SimdMap.map(Shape.cast, lane_shapes) signed = None for elwid in global_scope().lane_counts.keys(): lane_shape = self.lane_shapes.values[elwid] if signed is None: signed = lane_shape.signed else: assert signed == lane_shape.signed assert not isinstance(lane_shape, SimdShape), \ "can't have a SimdShape for a lane's shape" # Shape.__init__ doesn't like non-int widths, so pass in fake width for now super().__init__(1, signed) # width is a SimdMap with each lane's bit width for values width = SimdMapInt(SimdMap.map(lambda s: s.width, self.lane_shapes)) # finish calculating layout... ... # needed since Shape requires width to be an int class SimdMapInt(SimdMap, int): def __init__(self, values): # convert every value to an int values = SimdMap.map(int, values) super().__init__(values) so, a SimdShape would end up with fields like so: part_counts = { ElWid.F16: 1, # f16 takes 1 16-bit word ElWid.BF16: 1, # bf16 takes 1 16-bit word ElWid.F32: 2, # f32 takes 2 16-bit words ElWid.F64: 4, # f64 takes 4 16-bit words } with simd_scope(part_counts, total_parts=8): # 8 16-bit parts for 128-bit SIMD s = SimdShape(XLEN) s.signed == False s.width == SimdMapInt({ ElWid.F16: 16, ElWid.BF16: 16, ElWid.F32: 32, ElWid.F64: 64, }) s.lane_shapes == SimdMap({ ElWid.F16: unsigned(16), ElWid.BF16: unsigned(16), ElWid.F32: unsigned(32), ElWid.F64: unsigned(64), }) s.simd_width == 128 # what used to be SimdShape.width, since renamed s.part_count = 8 # 8 16-bit parts s.part_width = 16 ... # rest of `s`'es fields omitted sig = SimdSignal(s) sig.shape() == s sig.sig.shape() == unsigned(128) another example: with simd_scope(part_counts, total_parts=4): # 4 16-bit parts for 64-bit SIMD s = SimdShape(XLEN) s.signed == False s.width == SimdMapInt({ ElWid.F16: 16, ElWid.BF16: 16, ElWid.F32: 32, ElWid.F64: 64, }) s.lane_shapes == SimdMap({ ElWid.F16: unsigned(16), ElWid.BF16: unsigned(16), ElWid.F32: unsigned(32), ElWid.F64: unsigned(64), }) s.simd_width == 64 # what used to be SimdShape.width, since renamed s.part_count = 4 # 4 16-bit parts s.part_width = 16 ... # rest of `s`'es fields omitted sig = SimdSignal(s) sig.shape() == s sig.sig.shape() == unsigned(64)
(In reply to Luke Kenneth Casson Leighton from comment #60) thinking about it again, maybe we need to have a scoped global dict of supported partitioning modes (which elwidths to use for which lanes): example for FP: simd_modes = { # 4 lanes each with the shape for F16 SimdMode.F16x4: (FpElWid.F16,) * 4, # 4 lanes each with the shape for BF16 SimdMode.BF16x4: (FpElWid.BF16,) * 4, # 2 lanes each with the shape for F32 SimdMode.F32x2: (FpElWid.F32,) * 2, # 1 lane with the shape for F64 SimdMode.F64x1: (FpElWid.F32,), # we can add other combinations later, such as: #SimdMode.F16x2_F32x1: (FpElWid.F16,) * 2 + (FpElWid.F32,), # or even int/fp combinations: #SimdMode.I32x1_F32x1: (IntElWid.I32, FpElWid.F32,), } (By this point, ElWid would only be used by the SimdSignal system as a lane kind designator, so maybe we should have a separate LaneKind enum type and not use ElWid here, ElWid could still be used for the elwid Signal, of course) For an ALU: simd_mode = Signal(SimdMode) # simd_mode replaces elwid as the Signal that determines which # lanes are enabled...it can be set based on elwid of course. so, if we have the current value of simd_mode == SimdMode.F16x2_F32x1, then: s = SimdSignal(XLEN) # s has lanes with shapes u16, u16, and u32. mantissa_field = SimdSignal({ ElWid.F16: 10, ElWid.BF16: 7, ElWid.F32: 23, ElWid.F64: 52, }) # mantissa_field has lanes of u10, u10, and u23 unbiased_exponent = SimdSignal({ ElWid.F16: signed(5), ElWid.BF16: signed(8), ElWid.F32: signed(8), ElWid.F64: signed(11), }) # exponent has lanes of s5, s5, and s8
(In reply to Jacob Lifshay from comment #69) > reading through what you wrote on the wiki: > https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/ > > You appear to want to keep forcing SimdShape into being a Shape, and that > causes problems with SimdShape's current design... > > SimdShape can't be an instance of Shape as you are trying to do, because > Shape.width is the width of a nmigen scalar Signal/Value, yes, perfect. that means that any down-casting, in combination with the underlying Signal (SimdSignal.sig, a private member), perfectly allows down-casting. if Signal.cast() or Signal.like() is called, i expect it to grab the underlying Signal, grab the width, grab the sign, and blat the entirety of the underlying Signal into the (new) one. done. no hassle. no modifications to nmigen required to get that to work. this was the whole point of UserValue (and UserCastable), that you are *required* to provide compliance with Shape(). > when vectorizing > scalar code, Shape.width ends up being the lane width (excluding padding). > The problem arises because lanes have different widths for different elwid > values, so there *isn't* a single consistent integer value for Shape.width. > Shape.width *is not* the same as SimdShape.width, ah. then this is where you have become confused. it is *defined* so. that's the whole basis of PartitionedSignal / SimdSignal, that there *is* an underlying signal (private member, SimdSignal.sig) which *is* by definition exactly that. now, will there be blank unused portions of that underlying Signal named sig? yes. is it the job of SimdShape to help identify them? yes. all of the eq and other functions have been *specifically* designed around the underlying signal being of length and sign Shape.width and Shape.signed. that's the whole point. > since SimdShape.width is > the width of the whole SIMD signal, which is, from the perspective of user > code in an ALU, an arbitrary multiple of Shape.width. That multiplication > factor is determined by the calling code and can arbitrarily vary (e.g. we > could decide that we want 256-bit wide SIMD ALUs, or we could decide that we > want to make a tiny 32-bit cpu and we want 32-bit wide SIMD ALUs (which have > 0 64-bit lanes, 1 32-bit lane, 2 16-bit lanes, or 4 8-bit lanes)). yes. Shape.width == 256. or Shape.width==32. downcasting with Value.cast will pick both of those up. > A possible solution that keeps SimdShape as a subclass of Shape is to rename > the current SimdShape.width to something like .simd_width, allowing us to > instead have SimdShape.width be a SimdMapInt (a subclass of SimdMap and int) > of lane widths (the .width components of lane_shapes), that would meet the > semantic requirements of Shape. uh-uh, that's the wrong way round. yes it's one of the possibilities, but it's a mistake: it doesn't allow "clean" downcasting. the route that you are suggesting would require extensive modification of nmigen for nmigen to comprehend a new concept: simd_width. the idea here is *not* to make extensive modifications of nmigen. we currently have a ridiculously-small 420-line diff. no way we're doing extensive nmigen modifications, it would be suicidal for the project.
(In reply to Jacob Lifshay from comment #68) > looking through the code, you got what part_counts is mixed up: > https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/ > layout_experiment.py;h=1f0b761a019fae0fd0005a52649c5f5ad080bea2; > hb=48f1e8c97dd4adc9d9161e362f389fbf70506033#l94 > You wrote: > > 94 # for each element-width (elwidth 0-3) the number of partitions is given > > 95 # elwidth=0b00 QTY 1 partitions: | ? | > > 96 # elwidth=0b01 QTY 1 partitions: | ? | > > 97 # elwidth=0b10 QTY 2 partitions: | ? | ? | > > 98 # elwidth=0b11 QTY 4 partitions: | ? | ? | ? | ? | > > 99 # actual widths of Signals *within* those partitions is given separately > > 100 part_counts = { > > 101 0: 1, > > 102 1: 1, > > 103 2: 2, > > 104 3: 4, > > 105 } > > > part_counts is actually a dict from elwid values to the number of parts *in > a single lane*, *not* the number of parts in the whole SIMD signal. tcktcktck... *thinks*... where's the misunderstanding... we need some terminology. SIMD Lanes seems to refer to the "whole blob" (the entire batch of elements, irrespective of subdivision). http://15418.courses.cs.cmu.edu/spring2015/lecture/gpuarch/slide_003 however i have seen definitions where SIMD "lane" refers to the element. therefore let's try to avoid its use, but assume for the current context it means "the whole blob irrespective of element-level subdivisions" "part_counts is actually a dict from elwid values to the number of parts *in a single lane*" yep, i'd agree with that [caveat, "lane" being defined as "the whole blob" e.g. the ALU width) "*not* the number of parts in the whole SIMD signal" which would be *after* the subdivisions have taken place, and could potentially include blank (padding) sections. i.e. PartitionPoints. would you agree then that one was "requested" and the other "allocated"? (sort-of. PartitionPoints is admittedly a different data structure, but layout with the elwidth==i stuff makes it *possible* to express part_points-including-padding)
i think we're slowly getting to the bottom of where the assumptions are that caused you to believe that it is impossible to use nmigen with such tiny modifications (and to believe that it is undesirable to do so). it revolves around elwidths and the definition of Shape.width. the past 18 months work have been fundamentally based on the assumption that it is, by design, possible to fit. much of that time was spent verifying that that was possible. the assumption / definition and all code is based on: Shape.width == SimdShape.width == len(SimdSignal._its_internalsignal) we need - and other users will expect - that Casting will "just work". SimdSignal should be castable to Signal and all bits *directly* copied (including unused padding partitions) seamlessly at the bitlevel. likewise Record, and other nmigen constructs (sigh yes we need to do a SimdRecord at some point. not a priority] this *is* how nmigen works. all Value-derivatives *are* copyable from one to the other by making the fundamental assumption that when converted to bitlevel they are all effectively "the same". https://git.libre-soc.org/?p=nmigen.git;a=blob;f=nmigen/hdl/ast.py;h=372148205697a78280cee4c84cdf20ff2ec173e4;hb=6e7329d5cc593e05dfcb1f770a9cd878226a44ce#l1066 this is Signal.like(): 1066 kw = dict(shape=Value.cast(other).shape(), name=new_name) 1067 if isinstance(other, Signal): 1068 kw.update(reset=other.reset, reset_less=other.reset_less, 1069 attrs=other.attrs, decoder=other.decoder) 1070 kw.update(kwargs) 1071 return Signal(**kw, src_loc_at=1 + src_loc_at) fundamentally for that to work it is *required* that SimdSignal have a shape() function that returns a Shape() or Shape-derivative. the *entirety* of PartitionedSignal (now SimdSignal) is *fundamentally* designed on this principle. the element widths are an addition that is *our* problem to conceptualise and deal with, and as they are our problem they need to take second priority when fitting onto nmigen base-level pre-existing concepts. this is standard textbook OO design, and with nmigen being so well designed it's remarkable and fortunate that this is even possible. if however we choose to define the *element* width as the priority then we are in deep s*** basically because it would require a catastrophic cascade of modifications to nmigen *and* abandonment of 18 months of work. i am extremely good at identifying these kinds of "minimum retrofit" opportunities, i have been doing it for 20 years, and it would be foolish not to exploit them (even if the lead developer does not wish us to, and reneged on an agreement to consider so)
(In reply to Luke Kenneth Casson Leighton from comment #67) > def layout(elwid, part_counts, lane_shapes=None, > fixed_width=None): > if lane_shapes is None: > # or would this work? > # lane_shapes = {i: lane_width // lane_shapes[i] > for i in part_counts} actually, lane_shapes = {i: lane_width // part_counts[i] for i in part_counts} and it works well. the problem is that you've defined lane_shapes to *be* the sum total of the element width *times* the part_count. * for an element width of e.g. 16 * and for a part_count[2] == 2 * you require lane_shapes[2] to be **32** not 16 this makes no sense, and is completely counterintuitive. part_count defines the number of partitions in each elwidth lane_shapes naturally should define the *individual element* width *not* the element width times the number of partitions. otherwise that information bleeds back into the actual usage and it becomes impossible to define a transparent API that allows dynamic SIMD or Scalar compile-time switching
arrgh, ok now i understand what's going on, it's possible to identify that this is fundamentally broken: # compute a set of partition widths cpart_wid = [-lane_shapes[i] // c for i, c in part_counts.items()] print ("cpart_wid", cpart_wid, "part_counts", part_counts) cpart_wid = -min(cpart_wid) for the example given part_counts {0: 1, 1: 1, 2: 2, 3: 4} widths_at_elwidth = { 0: 5, 1: 6, 2: 12, 3: 24} it creates: cpart_wid [-5, -6, -6, -6] and produces a minimum of 5 for cpart_wid. whereas what is required is: part_counts {0: 1, 1: 1, 2: 2, 3: 4} widths_at_elwidth = { 0: 5, 1: 6, 2: 6, 3: 6} # NO part_count mul-factor and then for the results to be: # elwidth=0b00 1x 5-bit | unused ....5 | # elwidth=0b01 1x 6-bit | unused .....6 | # elwidth=0b10 2x 12-bit | unused .....6 | .....6 | # elwidth=0b11 4x 24-bit | .....6 | .....6 | .....6 | .....6 | and therefore there to be ***FOUR*** partition points ***NOT*** 14 as is currently the case by defining the base width to be 5 v v v vv v # elwidth=0b00 1x 5-bit | | | || ....5 | # elwidth=0b01 1x 6-bit | | | | .....6 | # elwidth=0b10 2x 12-bit | | | .....6 | .....6 | # elwidth=0b11 4x 24-bit | .....6 | .....6 | .....6 | .....6 | the partition points created should therefore be at: * 0 (which gets deleted, later) * 5 (to fit the single 5-bit value when elwidth=0b00) * 6 (to fit the single 6-bit value when elwidth=0b01 and all other 1st elements * 12 and 18 (to fit the other 2 elwidths 0b10 and 0b11 * 24 (which gets deleted, later) layout as designed creates something absolutely mad, something in excess of 14 partition points! test elwidth=0 (0, ([1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0], 64, {0: 5, 1: 6, 2: 12, 3: 24}, 16, 4)) creating that compact set is a lot more involved than simply selecting the minimum width and dictating that all partitions shall fit onto an aligned boundary that must always begin at a multiple of that minimum width. intuition tells me it should probably be the other way round: start at the maximum element width and subdivide all others to fit into that. round up all element widths within an elwidth to that max value. assume all partition counts are power-of-two, and i think this is the basis of an algorithm that would slot everything onto suitable "shared" boundaries.
rright. ok. i think i have a basis for the algorithm, it is multi-step. 1. (new) assume all partition counts are a power-2. 1 2 4 8 ... 2. use either fixed_width (if given) or a computed total width. 3. obtain the max number of partitions (max(part_counts.values()) 4. divide total width by maxparts, total width is rounded up. (4) becomes the partition subdivision size. from there i think the rest of the algorithm (phase 1) takes over. for the example in layout_exoeriment.py that subdivision size should turn out to be *6* not 5.
(In reply to Luke Kenneth Casson Leighton from comment #75) > arrgh, ok now i understand what's going on, it's possible to identify > that this is fundamentally broken: > > # compute a set of partition widths > cpart_wid = [-lane_shapes[i] // c for i, c in part_counts.items()] > print ("cpart_wid", cpart_wid, "part_counts", part_counts) > cpart_wid = -min(cpart_wid) > > for the example given > > part_counts {0: 1, 1: 1, 2: 2, 3: 4} > widths_at_elwidth = { 0: 5, 1: 6, 2: 12, 3: 24} > > it creates: > > cpart_wid [-5, -6, -6, -6] > > and produces a minimum of 5 for cpart_wid. no, it produces a minimum of -6, since -6 is less than -5.
(In reply to Luke Kenneth Casson Leighton from comment #74) > the problem is that you've defined lane_shapes to *be* the > sum total of the element width *times* the part_count. no, i didn't. a lane (minus padding) *is always* a single element. a lane for one elwid (including padding) is split into multiple smaller lanes for a *different* elwid. > > * for an element width of e.g. 16 > * and for a part_count[2] == 2 > * you require lane_shapes[2] to be **32** not 16 lane_shapes[2] = unsigned(16) for an element shape of unsigned(16). lane_shapes *never* includes padding. > > this makes no sense, and is completely counterintuitive. > > part_count defines the number of partitions in each elwidth > > lane_shapes naturally should define the *individual element* > width *not* the element width times the number of partitions. yup! > otherwise that information bleeds back into the actual usage > and it becomes impossible to define a transparent API that > allows dynamic SIMD or Scalar compile-time switching Yup!
(In reply to Jacob Lifshay from comment #77) > > and produces a minimum of 5 for cpart_wid. > > no, it produces a minimum of -6, since -6 is less than -5. ngghhh ok yes. min -ve then negate is max. tricks like that always need documenting. ok so what the heck is going on. the new code i added, ended up computing really weird partition widths, and the example you created with 5 6 6 6 produces 14 points.
(In reply to Jacob Lifshay from comment #78) > (In reply to Luke Kenneth Casson Leighton from comment #74) > > the problem is that you've defined lane_shapes to *be* the > > sum total of the element width *times* the part_count. > > no, i didn't. a lane (minus padding) *is always* a single element. ok whew, that's what i'd expect. > a lane > for one elwid (including padding) is split into multiple smaller lanes for a > *different* elwid. > > > > * for an element width of e.g. 16 > > * and for a part_count[2] == 2 > > * you require lane_shapes[2] to be **32** not 16 > > lane_shapes[2] = unsigned(16) for an element shape of unsigned(16). just 16. not unsigned(16). > > lane_shapes *never* includes padding. yes agreed. this makes sense. > > > > this makes no sense, and is completely counterintuitive. > > > > part_count defines the number of partitions in each elwidth > > > > lane_shapes naturally should define the *individual element* > > width *not* the element width times the number of partitions. > > yup! > > > otherwise that information bleeds back into the actual usage > > and it becomes impossible to define a transparent API that > > allows dynamic SIMD or Scalar compile-time switching > > Yup! nggggh so why does the example i added where lane_shapes=None calculates this based on same part_count and fixed_width=32: 0: 32 1: 32 2: 16 3: 8 but then the partition cpart_wid gets computed as completely wrong, like 32 32 8 2 or something. can you take a look?
cpart_wid = -min(-lane_shapes[i] // c for i, c in part_counts.items()) that's definitely dividing the lane_shape _total_ by the part count. it should probably just be cpart_wid = -min(-lane_shapes[i] for i, c in part_counts.items()) and the input, lane_shapes, should be 0: 5 1: 6 2: 6 3: 6 you definitely gave: 0: 5 1: 6 2: 12 3: 24 check comment #20 for i in range(4): l = {0: signed(5), 1: signed(6), 2: signed(12), 3: signed(24)} pprint((i, layout(i, part_counts, l)))
(In reply to Luke Kenneth Casson Leighton from comment #73) > i think we're slowly getting to the bottom of where the assumptions > are that caused you to believe that it is impossible to use nmigen > with such tiny modifications (and to believe that it is undesirable > to do so). > > it revolves around elwidths and the definition of Shape.width. > > the past 18 months work have been fundamentally based on the assumption > that it is, by design, possible to fit. much of that time was spent > verifying that that was possible. > > the assumption / definition and all code is based on: > > Shape.width == SimdShape.width == len(SimdSignal._its_internalsignal) well, the problem is that, to correctly maintain the appearance of operating on *scalar* values needed for all our existing OpenPower ALUs (waay more code than PartitionedSignal), we need it to look like a single element, *not* a list. This means SimdSignal.width *has to be* the width of a *single* element. > > we need - and other users will expect - that Casting will "just work". > SimdSignal should be castable to Signal and all bits *directly* > copied (including unused padding partitions) seamlessly at the bitlevel. we need Signal/SimdSignal interconversion, but it should be done via having non-SIMD code access SimdSignal.sig, or by a .bitcast member function, not by just doing Signal.eq. In particular, what your advocating for (for SimdSignal to transparently inter-convert with Signal), just 1:1 copying bits, will *not work* nicely -- it's in conflict with what happens when a scalar is converted to a vector, which is that that scalar is *splatted* into all lanes of the vector, not that the scalar is split into pieces and each piece becomes a lane. Example (addi): class MyAddI(...): def __init__(self): with simd_scope(self, create_elwid=True): self.RT = SimdSignal(XLEN) # output: vector self.RA = SimdSignal(XLEN) # input: vector self.Imm = Signal(signed(16)) # input: *scalar*, since it's not a register def elaborate(self, platform): m = Module() with simd_scope(self, m=m): sliced_imm = SimdSignal(XLEN) m.d.comb += sliced_imm.eq(self.Imm) # slice off bits that don't fit in XLEN m.d.comb += self.RT.eq(self.RA + sliced_imm) return m In order for the semantics to be correct, the conversion from self.Imm to a SimdSignal has to put *all* of Imm's value into each lane -- *instead of what you are advocating for* -- what you are advocating for would split Imm into 2-bit/4-bit/8-bit/16-bit pieces and put each of those pieces into each corresponding lane, which is exactly what a 1:1 bit copy does, which is *totally wrong* here. Example (with elwid == 16-bits): RA = 0x0123_4567_89AB_CDEF instruction: addi rt, ra, 0x7531 Right way (splatting): RA (lanes): [0xCDEF, 0x89AB, 0x4567, 0x0123] Imm: 0x7531 sliced_imm (lanes): [0x7531, 0x7531, 0x7531, 0x7531] RT (lanes): [0xCDEF + 0x7531, 0x89AB + 0x7531, 0x4567 + 0x7531, 0x0123 + 0x7531] == [0x4320, 0xFEDC, 0xBA98, 0x7654] RT = 0x7654_BA98_FEDC_4320 Wrong way (1:1 bit conversion -- convert imm to 16-bit wide simd then sign extend): RA (lanes): [0xCDEF, 0x89AB, 0x4567, 0x0123] Imm: 0x7531 sliced_imm (1:1 converted into lanes): [0x1, 0x3, 0x5, 0x7] RT (lanes): [0xCDEF + 0x1, 0x89AB + 0x3, 0x4567 + 0x5, 0x0123 + 0x7] == [0xCDF0, 0x89AE, 0x456C, 0x012A] RT = 0x012A_456C_89AE_CDF0 Another wrong way (1:1 bit conversion -- sign extend imm to simd width then convert): RA (lanes): [0xCDEF, 0x89AB, 0x4567, 0x0123] Imm: 0x7531 sliced_imm (1:1 converted into lanes, only lower 16-bits of full 64-bits had anything): [0x1357, 0x0, 0x0, 0x0] RT (lanes): [0xCDEF + 0x1357, 0x89AB + 0x0, 0x4567 + 0x0, 0x0123 + 0x0] == [0x7654, 0x89AB, 0x4567, 0x0123] RT = 0x0123_4567_89AB_7654 > this *is* how nmigen works. all Value-derivatives *are* copyable > from one to the other by making the fundamental assumption that > when converted to bitlevel they are all effectively "the same". Well, SimdSignal is *fundamentally* incompatible with nmigen Value/Signal, since nmigen expects Values to act like a single scalar value, whereas SimdSignal acts like a *list* of scalar values -- aka. a Vector.
(In reply to Luke Kenneth Casson Leighton from comment #81) > cpart_wid = -min(-lane_shapes[i] // c for i, c in part_counts.items()) > > that's definitely dividing the lane_shape _total_ by the part count. that's because that's computing the width of *one* part. a single lane can occupy multiple parts (a lane occupies part_counts[i] parts) > you definitely gave: > > 0: 5 > 1: 6 > 2: 12 > 3: 24 yup, because I *wanted* lanes to be 5, 6, 12, and 24 bits wide, not 5, 6, 6, and 6. That was to demonstrate how the part width is calculated to fit the minimum required bits, which can be less than a lane's width, because that lane occupies multiple parts.
(In reply to Jacob Lifshay from comment #83) > (In reply to Luke Kenneth Casson Leighton from comment #81) > > cpart_wid = -min(-lane_shapes[i] // c for i, c in part_counts.items()) > > > > that's definitely dividing the lane_shape _total_ by the part count. it's dividing the width of one lane by the number of parts in that lane, getting the number of bits that lane needs per-part. > > that's because that's computing the width of *one* part. a single lane can > occupy multiple parts (a lane occupies part_counts[i] parts)
(In reply to Jacob Lifshay from comment #82) > well, the problem is that, to correctly maintain the appearance of operating > on *scalar* values needed for all our existing OpenPower ALUs (waay more > code than PartitionedSignal), we need it to look like a single element, > *not* a list. This means SimdSignal.width *has to be* the width of a > *single* element. no! jacob: i *already have* SimdSignal Assign functional and correctly working. note, i did not say "eq" i said "Signal.like" and talked about casting. that's completely different. to move the topic to SimdSignal.eq: * SimdSignal.eq *already detects* the type of object that it is being Assigned to. * for a SimdSignal.eq(Signal) it performs a scalar-to-vector elwidth-context-aware duplication (Vector ISA "broadcast"). * for a SimdSignal.eq(SimdSignal) it performs a full range of detection of partition sizes (smaller, equal, greater) and deals with them. *this is already done*. *it is functional* you are weeks if not months behind on the functionality of the design which *already works* here is the revelant code: https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_ass/assign.py;h=2a79cbbc596b65ea893b02cbeb79127811326bc5;hb=4fca5d99b872bf14c61d2e73932c9bf340cfd22d#l61 that's all it takes. it's even documented https://libre-soc.org/3d_gpu/architecture/dynamic_simd/assign/ note: it even works with signed and unsigned, performing zero/sign-extension or truncation as necessary, for both vector-to-vector and scalar-to-vector assignment. furthermore, vector-to-scalar (scalar.eq(vector) will work by auto-casting down through UserValue down-casting to bit-level. which is exactly what is expected and desired. > > > > we need - and other users will expect - that Casting will "just work". > > SimdSignal should be castable to Signal and all bits *directly* > > copied (including unused padding partitions) seamlessly at the bitlevel. > > we need Signal/SimdSignal interconversion, but it should be done via having > non-SIMD code access SimdSignal.sig, or by a .bitcast member function, not > by just doing Signal.eq. i did the analysis 2 weeks ago and successfully worked it out and successfully implemented Signal.eq. there is no need and absolutely no desire to complicate things that have already been successfully done and completed. > In particular, what your advocating for (for > SimdSignal to transparently inter-convert with Signal), for *casting* - yes. when the developer *asks* for that, by using Signal.like and uses casting.... yes. > just 1:1 copying > bits, will *not work* nicely -- it's in conflict with what happens when a > scalar is converted to a vector, which is that that scalar is *splatted* > into all lanes of the vector, not that the scalar is split into pieces and > each piece becomes a lane. already handled. > Well, SimdSignal is *fundamentally* incompatible with nmigen Value/Signal, no it is not! it's already been implemented, and *is* compatible! > since nmigen expects Values to act like a single scalar value, no. whitequark *wants* to *define* nmigen to be thus. whitequark, just before banning me from participation on #nmigen, made it bluntly clear, as part of reneging on the agreement back in april, that this decision was made. whether the code *can* support full abstraction (which with a mere 420-line diff, it can), whitequark unilaterally decided that that was completly irrelevant. > whereas > SimdSignal acts like a *list* of scalar values -- aka. a Vector. yes. and therefore the entire suite of operators of SimdSignal can be *defined* to act in list-form, at every single level in the entirety of ast.* operations. it *works* - full stop. there is not one single unit test in the entirety of test_partsig.py which is not already demonstrating and proving that this concept - fact - works - fact. please listen to what i am telling you: it works! fact! have you run the unit test and / or contributed to the individual modules?
(In reply to Jacob Lifshay from comment #83) > (In reply to Luke Kenneth Casson Leighton from comment #81) > > cpart_wid = -min(-lane_shapes[i] // c for i, c in part_counts.items()) > > > > that's definitely dividing the lane_shape _total_ by the part count. > > that's because that's computing the width of *one* part. a single lane can > occupy multiple parts (a lane occupies part_counts[i] parts) i did say: it's important to stop using the word "lane" because it has multiple definitions. > > yup, because I *wanted* lanes to be 5, 6, 12, and 24 bits wide, not 5, 6, 6, > and 6. That was to demonstrate how the part width is calculated to fit the > minimum required bits, which can be less than a lane's width, because that > lane occupies multiple parts. ok. so now we come back to comment #75, that by defining widths in terms of "lanes" is the wrong thing to do. what you are calling "lane" is "part_count" times "element width". what is *actually needed* is just, "element width". this is because when converting Scalar code to SIMD, it is necessary to allow a Signal of width e.g. 16 to be multiplied by the number of partitions *automatically* when that Signal class is contextually-replaced with SimdSignal class (via a compile-time switch). if you force people to put what you call a "lane" width in, that Signal width of 16 will become subdivided by 2 for elwidth=32, subdivided by 4 for elwidth=16 and subdivided by 8 for elwidth=8 how can you fit a 16-bit computation into 2-bit data when elwidth=8? you can't, can you? therefore, so as not to make massive intrusive changes to the Scalar code, polluting it with SIMD-style context that it doesn't need, the *element* width has to be passed in to layout() *not* element width times partition count. by passing that *element* width in, layout() will allocate 16-bit elements *of that size*, at *all* elwidths. this is the desired behaviour elwidth times partition count is not. is this clear, now?
(In reply to Luke Kenneth Casson Leighton from comment #85) > (In reply to Jacob Lifshay from comment #82) > > > we need - and other users will expect - that Casting will "just work". > > > SimdSignal should be castable to Signal and all bits *directly* > > > copied (including unused padding partitions) seamlessly at the bitlevel. Ok, I read "casting" as "Signal.eq", and didn't think beyond that, my bad. > > Well, SimdSignal is *fundamentally* incompatible with nmigen Value/Signal, > > no it is not! it's already been implemented, and *is* compatible! > > > since nmigen expects Values to act like a single scalar value, > > no. whitequark *wants* to *define* nmigen to be thus. And I agree with her there, nmigen has always been scalar, where each Value is a single value, not a list of values. Basically everybody expects nmigen's Value to be a scalar. If you insist on rewriting nmigen to change that fundamental assumption, you'll end up with a different language, that is called nmigen only because you are confusing your fundamental assumption-breaking changes as being compatible with nmigen, and you can sorta get away with it because nmigen is written in Python and only checks some of the typing. I'm advocating for making a set of classes SimdSignal, SimdValue, SimdShape, etc. that effectively is a Simd-ified version of nmigen Signal, Value, Shape, etc. *and their named differently* because they fundamentally are lists of scalars, not single scalars. > whitequark, > just before banning me from participation on #nmigen, oh, sad and surprised to hear that...I'll wait a while and then see if I can convince whitequark to let you back, since your behavior wasn't bad enough to warrant a ban imho (unless something happened that I didn't see), though apologising for making mwk feel as if you're talking down to them wouldn't hurt (apologising for what they percevied as condescending, which is not what you intended). pointing out that they're being rude, while true, won't actually help everyone get along. pointing out and asking people to not be rude is mostly whitequark's responsibility, not yours, since she is the channel moderator. Hopefully, my pointing this out will help everyone, my intention isn't to hurt you but help you realize a potentially better way to get along with everyone. > made it bluntly > clear, as part of reneging on the agreement back in april, that this > decision was made. > > whether the code *can* support full abstraction (which with a mere > 420-line diff, it can), whitequark unilaterally decided that that was > completly irrelevant. Well, I'm saying your 90% of the way there, but the last 10% of abstraction is the harder part...it's easier and more consistent to just have a separate API that mirrors nmigen's API, allowing unmodified nmigen to work with our code, as well as making our code much clearer. This wouldn't require that much effort, since we can use most of the existing PartitionedSignal code with minor modifications and renaming. > > whereas > > SimdSignal acts like a *list* of scalar values -- aka. a Vector. > > yes. and therefore the entire suite of operators of SimdSignal can be > *defined* to act in list-form, at every single level in the entirety of ast.* > operations. I'm saying the existing design only does a 90% job, it covers list-form +, -, *, &, ^, |, .eq, etc. It does *not* cover Shape.width, Signal.like, etc. I'm saying we should finish off getting to 100% of the API looking like a transparently vectorized (aka SIMT) version of nmigen Shape/Signal/Value/etc., just with different names, since that will make it much easier to use/understand. Note that Shape.width (lane width) not being a plain integer comes from lanes being XLEN-width, not from being vectorized. those are orthogonal unrelated changes. We could have a totally scalar Signal/Value/etc. that are all XLEN-ified, that would require modifying Shape.width to be multivalued. We could also vectorize our totally 64-bit-only code, that would leave Shape.width == 64 (since that's the *element* width), even if a vector was 256-bits wide. +----------------+---------------+-------------------------------------------+ | XLEN-ification | Vectorization | | +================+===============+===========================================+ | No | No | Original Scalar code; | | | | Shape.width == 64 | +----------------+---------------+-------------------------------------------+ | No | Yes | SimdShape.width==64, | | | | SimdShape.simd_width==128 | +----------------+---------------+-------------------------------------------+ | Yes | No | XLShape.width==XLEN== | | | | XLMap({I8:8,I16:16,I32:32,I64:64}) | +----------------+---------------+-------------------------------------------+ | Yes | Yes | SimdShape.width==XLEN== | | | | SimdMap({I8:8,I16:16,I32:32,I64:64}), | | | | SimdShape.simd_width==128 | +----------------+---------------+-------------------------------------------+ > have you run the unit test and / or contributed to the individual modules? Yes, I have -- don't forget that I'm the one who originally wrote the PartitionedAdd and PartitionedMultiply -- the inspiration for this whole thing.
(In reply to Luke Kenneth Casson Leighton from comment #86) > (In reply to Jacob Lifshay from comment #83) > > yup, because I *wanted* lanes to be 5, 6, 12, and 24 bits wide, not 5, 6, 6, > > and 6. That was to demonstrate how the part width is calculated to fit the > > minimum required bits, which can be less than a lane's width, because that > > lane occupies multiple parts. > > ok. so now we come back to comment #75, that by defining widths in terms > of "lanes" is the wrong thing to do. > > what you are calling "lane" is "part_count" times "element width". no, I'm explicitly not. I explicitly stated the definition of "lane" i'm using is a vector element. a part is not an element, it is a fixed number of bits (doesn't change with different elwid). a vector element + it's padding is made of a power of two number of parts, that number of parts is "part_count". > > what is *actually needed* is just, "element width". yes, exactly (once you add in padding). I've been saying that the whole time. > > this is because when converting Scalar code to SIMD, it is necessary > to allow a Signal of width e.g. 16 to be multiplied by the number > of partitions *automatically* when that Signal class is contextually-replaced > with SimdSignal class (via a compile-time switch). > > if you force people to put what you call a "lane" width in, > that Signal width of 16 will become subdivided by 2 for elwidth=32, > subdivided by 4 for elwidth=16 and subdivided by 8 for elwidth=8 > > how can you fit a 16-bit computation into 2-bit data when elwidth=8? > > you can't, can you? Of course you can't if you *insist* on the SimdSignal constructor taking the full-simd-width. It works great, by contrast, if the SimdSignal constructor takes the *element width*, as I've been advocating for for a while. > this is the desired behaviour > > elwidth times partition count is not. > > is this clear, now? Yup totally!
After talking it over in today's virtual coffee meeting, we agreed that we should work on finishing the existing design, and leave improvements for later. I do think that having just the XLEN global idea will save us a huge amount of time, so I'm going to try to build a version with just that added, and leave the rest mostly unmodified.
(In reply to Jacob Lifshay from comment #88) > Of course you can't if you *insist* on the SimdSignal constructor taking the > full-simd-width. It works great, by contrast, if the SimdSignal constructor > takes the *element width*, as I've been advocating for for a while. brief: (1am), both are needed. see comment #63. (In reply to Jacob Lifshay from comment #89) > I do think that having just the XLEN global idea will save us a huge amount > of time, yes. but only if its class is SimdShape which derives from Shape. self.XLEN=SimdShape(64, part_count={0:1, 1:2, 2:4, 3:8}) where XLEN//2 returns a new SimdShape with overall width=32, partition counts the same, and vector element wifths halved (//2) at all elwidths. a different mode is needed for when a SimdShape is declared WITHOUT a fixed width, but instead with fixed vector element widths. in that mode the arithmetic operations are done *on the vector element widths*. two modes. this is essential. > so I'm going to try to build a version with just that added, and > leave the rest mostly unmodified. make it SimdShape. derive from Shape. recalculate layout based on vector element width being passed in. but first the bug in layout() needs to be fixed.
(In reply to Luke Kenneth Casson Leighton from comment #90) > (In reply to Jacob Lifshay from comment #88) > > > Of course you can't if you *insist* on the SimdSignal constructor taking the > > full-simd-width. It works great, by contrast, if the SimdSignal constructor > > takes the *element width*, as I've been advocating for for a while. > > brief: (1am), both are needed. see comment #63. yup, both are needed, but full-simd-width is passed in through a scoped-global, not through SimdSignal.__init__ arguments. > > (In reply to Jacob Lifshay from comment #89) > > > I do think that having just the XLEN global idea will save us a huge amount > > of time, > > yes. but only if its class is SimdShape which derives from Shape. wait...I am planning on XLEN being a type that can be passed into SimdShape, but it isn't a Shape or a SimdShape -- it acts like an integer, not a Shape.
(In reply to Luke Kenneth Casson Leighton from comment #90) > but first the bug in layout() needs to be fixed. yeah yeah, i'll work on that first.
(In reply to Jacob Lifshay from comment #91) > yup, both are needed, but full-simd-width is passed in through a > scoped-global, not through SimdSignal.__init__ arguments. no. won't work. parameter to SimdShape. basically parameters to layout() become parameters to SimdShape. full_simd_width named parameter defaults to None, this is fixed_width param of layout(). gets put into Shape.width directly. SimdShape constructor 1st arg, if int, this is vector element width. if dict, is the layout elwidths thing. actual width is computed and set in Shape.width > > > > (In reply to Jacob Lifshay from comment #89) > > > > > I do think that having just the XLEN global idea will save us a huge amount > > > of time, > > > > yes. but only if its class is SimdShape which derives from Shape. > > wait...I am planning on XLEN being a type too much focus on "types", transferring concepts from rust not appropriate. XLEN as a SimdShape, fixed_width=64. see url at top of bugreport.
(In reply to Luke Kenneth Casson Leighton from comment #93) > (In reply to Jacob Lifshay from comment #91) > > wait...I am planning on XLEN being a type > > too much focus on "types", transferring concepts from rust not > appropriate. my point is that the class that XLEN is an instance of, *should* be general enough to use for non-integers (since it would be handy for things like m.Case() patterns, and other places). If we split out the arithmetic from SimdShape into the separate class (SimdMap), then it can handle all the arithmetic, patterns, etc. I'm going to try it out as a separate class, and you can see what you think once it's in there...if it doesn't work, then we can merge it back into SimdShape pretty easily.
(In reply to Jacob Lifshay from comment #94) > (In reply to Luke Kenneth Casson Leighton from comment #93) > > (In reply to Jacob Lifshay from comment #91) > > > wait...I am planning on XLEN being a type > > > > too much focus on "types", transferring concepts from rust not > > appropriate. > > my point is that the class that XLEN is an instance of, *should* be general > enough to use for non-integers (since it would be handy for things like > m.Case() patterns, and other places). not in any existing ALU (there's 12) all Case()s involve operand fields. case extsb, case extsw, case extsh etc. all scalar. > If we split out the arithmetic from > SimdShape into the separate class (SimdMap), then it can handle all the > arithmetic, patterns, etc. > > I'm going to try it out as a separate class, and you can see what you think > once it's in there...if it doesn't work, then we can merge it back into > SimdShape pretty easily. ok. just remember, it needs 2 completely different behaviours: * fixed_width priority (recalculates laneshapes aka vector element widths) * vector element widths priority (recalculates overall width aka Shape.width aka SimdShape.width) basically calls layout() to recalculate. dont go overboard. add, mul, div, sub, is probably about it. operands int more than enough. layout needs fixing first otherwise recalc wont work.
(In reply to Luke Kenneth Casson Leighton from comment #95) > layout needs fixing first otherwise recalc wont work. Fixed: https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=6ddea00d11bbf3301c9266ec10bc5ac1ea693973;hb=9a318256b74054b8d592efe7be298764d0de415a I also added a bunch of docs to layout()
I added XLEN, and refactored layout into a SimdLayout class. (In reply to Luke Kenneth Casson Leighton from comment #95) > dont go overboard. add, mul, div, sub, is probably about it. > operands int more than enough. umm...well lets just say you won't have to worry about missing some methods... maybe I should have put it in an alternate branch...
(In reply to Jacob Lifshay from comment #97) > I added XLEN, and refactored layout into a SimdLayout class. as a 957-line diff. remember i said that the Project Development Policy is to do small incremental commits? git diff 4fca5d | wc 957 4486 36684 > (In reply to Luke Kenneth Casson Leighton from comment #95) > > dont go overboard. add, mul, div, sub, is probably about it. > > operands int more than enough. > > umm...well lets just say you won't have to worry about missing some > methods... this will take several days for me to review and understand. far from moving the project forward when we are under time pressure, that unfortunately delays it even more. > maybe I should have put it in an alternate branch... it's not the branch, it's that it's such a massive diff that it stops me from being able to understand it. ok. what to do. what have i seen other project lead developers do in the past... i know: yes, it needs reversion, and moving to a branch, and to create a series of "planned" (incremental) patches. 1) add SimdMap and util.py but do *not* modify layout_experiment.py 2) fix the bugs in layout_experiment.py (and leave in the accidental example where the layout was {0:5 1:6 2:6 3:6}) 3) update layout_experiment.py to a class but do *not* get it to use SimdMap (because it hasn't been reviewed / authorised) 4) then we join the two together also, leave in the dots: + # elwid=F64 1x 24-bit |<--------i24-------->| + # elwid=F32 2x 12-bit |<---i12--->|<---i12--->| + # elwid=F16 4x 6-bit |<-i6>|<-i6>|<-i6>|<-i6>| + # elwid=BF16 4x 5-bit |.<i5>|.<i5>|.<i5>|.<i5>| what you'd done in the (957-line diff) was remove the examples which i'd spent some time analysing and understanding. i _was_ going to add Asserts which would (eventually form the basis of a proper unit test) show that layout is producing the wrong answer because you have made *two* changes combined into one, it is now very difficult to illustrate to you what i mean. you changed this: widths_at_elwidth = { 0: 5, 1: 6, 2: 12, 3: 24 } to this: widths_at_elwidth = { FpElWid.F64: 24, FpElWid.F32: 12, FpElWid.F16: 6, FpElWid.BF16: 5, } (and likewise part_count) part_counts = { - 0: 1, - 1: 1, - 2: 2, - 3: 4, + FpElWid.F64: 4, + FpElWid.F32: 2, + FpElWid.F16: 1, + FpElWid.BF16: 1, } which is *two* changes. 1) changing of the actual vector element widths 2) changing of the specification of elwidth from int type to FpElwid type both of these compounded changes now make it extra work to explain. in the existing layout_experiment.py (as of yesterday), the following was done: widths_at_elwidth = { 0: 5, # not divided --> vec element = 5 1: 6, # not divided --> vec element = 6 2: 12, # gets divided by 2 --> vec element = 6 3: 24 # gets divided by 4 --> vec element = 6 } oh s*** i think i now know what you were talking about, finally. + FpElWid.F64: 4, + FpElWid.F32: 2, + FpElWid.F16: 1, + FpElWid.BF16: 1, you've still specified F64 as comprising *four* partitions. the requirement is to *NOT* specify the number of *BYTES*. the requirement is to specify the NUMBER OF ELEMENTS. we *require* part_count to be: part_counts = { 0: 1, # at elwidth=0b00, the number of vector elements = 1 1: 1, # at elwidth=0b01, the number of vector elements = 1 2: 2, # at elwidth=0b10, the number of vector elements = 2 3: 4, # at elwidth=0b11, the number of vector elements = 4 } **NOT** "the total number of bytes at elwdith=0b0 shall be one (1) much as i love the documentation in layout_experiment.py, unfortunately because it is an "all-or-nothing" commit, the entire lot needs reversion (i'll handle it). util.py can stay as it is. i really do love the documentation in layout_experiment.py btw. once reverted i'll see if i can extract it.
ok that's done. i was able to keep the documentation / glossary because it was prior to the conversion to a class. now. i finally understood that you've been specifying part_count as the number of bytes. this will restrict the layout and create too many partition points (14 rather than 5 in the case of the 5-5-6-6 example). i've *partly* fixed that with a 1-line commit: commit 6a50b2d1649fbf1f22be783eaa5ebbb9cdf72f64 (HEAD -> master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Wed Oct 13 12:51:09 2021 +0100 redefine part_counts to be "number of vector elements in a partition" fixes the bug it removes the divide operation: - cpart_wid = [-lane_shapes[i] // c for i, c in part_counts.items()] + cpart_wid = [-lane_shapes[i] for i, c in part_counts.items()] therefore, actually, cpart_wid probably becomes just max(lane_shapes.values()) - cpart_wid = [-lane_shapes[i] for i, c in part_counts.items()] - print("cpart_wid", cpart_wid, "part_counts", part_counts) - cpart_wid = -min(cpart_wid) + print("lane_shapes", lane_shapes, "part_counts", part_counts) + cpart_wid = max(lane_shapes.values()) yep, that works. now, next thing: fix the documentation as well as rename part_counts to... to... vec_el_counts so as to avoid this confusion. vec_el_counts is "the number of vector elements required at each elwidth" ... done commit 74ec642c0f4eb67a80b785a191ae11875416300a (HEAD -> master, origin/master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Wed Oct 13 13:06:25 2021 +0100 rename part_counts to vec_el_counts to indicate that it is intended to be the count of the number of vector elements within a partition so now there is just one bug remaining: 5,6,6,6 elements {0: 5, 1: 6, 2: 6, 3: 6} lane_shapes {0: 5, 1: 6, 2: 6, 3: 6} vec_el_counts {0: 1, 1: 1, 2: 2, 3: 4} width 24 6 4 dpoints {5: [0], 6: [0, 1, 1, 2, 3], 11: [0], 12: [0, 1, 1, 2], 17: [0], 18: [0, 1, 1, 2], 23: [0]} the bug is: 11, 17 and 23 should *not* have partition points. there should only be one element added at elwidth==0, but instead *FOUR* are being added. * when elwdith=0 the vec_el_count is 1 so one is being added from 0-5 * but there is also one being added from 6-11 * and another from 12-17 * and another from 18-23 also, i see only one when elwidth==3. so it is inverted. i will see if i can fix that.
that's it. that does the job. for i, c in vec_el_counts.items(): + # calculate part_wid based on overall width divided by number + # of elements. + part_wid = width // c def add_p(p): dpoints[p].append(i) # auto-creates list if key non-existent - for start in range(0, part_count, c): + # for each elwidth, create the required number of vector elements + for start in range(c): commit a52b594062be6fd16d37b1f906904720bdda3525 (HEAD -> master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Wed Oct 13 13:25:09 2021 +0100 fix layout() to put in only the number of *requested* vector elements 5,6,6,6 elements {0: 5, 1: 6, 2: 6, 3: 6} lane_shapes {0: 5, 1: 6, 2: 6, 3: 6} vec_el_counts {0: 1, 1: 1, 2: 2, 3: 4} width 24 6 4 dpoints {5: [0], 6: [1, 2, 3, 3], 12: [2, 3, 3], 18: [2, 3, 3]} HA! excellent. that's the right answer. only 4 breakpoints are created, not 14. 211 # specify that the Vector Element lengths are to be *different* at 212 # each of the elwidths. 213 # combined with vec_el_counts we have: 214 # elwidth=0b00 1x 5-bit | <-- unused -->....5 | 215 # elwidth=0b01 1x 6-bit | <-- unused -->.....6 | 216 # elwidth=0b10 2x 12-bit | unused .....6 | unused .....6 | 217 # elwidth=0b11 3x 24-bit | .....6 | .....6 | .....6 | .....6 | 218 # expected partitions (^) ^ ^ ^^ (^) 219 # to be at these points: (|) | | || (|) 18 12 65 the breakpoints at 5, 6, 12 and 18 are now correctly created. i'll add an assert to that effect. (edit: commit 93cc0c33f0177226ee156a7623642335d0a74e71 (HEAD -> master, origin/master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Wed Oct 13 13:36:47 2021 +0100 add assert to check that the 5-6-6-6 example returns the expected partition points 5,6,12,18 )
commit 27003c0bfe781d4da60a6eff4e26875f4b923f68 (HEAD -> master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Wed Oct 13 15:23:03 2021 +0100 remove signed. again i've removed the signed argument from layout(). again. please don't add it in again. it is in absolutely no way relevant to the computations going on inside layout() and it is a major distraction. if the widths were in any way altered *by* the sign *then and only then* would it be relevant to pass it in as a parameter. please *only* put in what is *necessary* to each function. sign is *never* going to be relevant to this function.
this example is going wrong: # fixed_width=32 and no lane_widths says "allocate maximum" # i.e. Vector Element Widths are auto-allocated # elwidth=0b00 1x 32-bit | .................32 | # elwidth=0b01 1x 32-bit | .................32 | # elwidth=0b10 2x 12-bit | ......16 | ......16 | # elwidth=0b11 3x 24-bit | ..8| ..8 | ..8 |..8 | # expected partitions (^) | | | (^) # to be at these points: (|) | | | | output: maximum allocation from fixed_width=32 lane_shapes {0: 32, 1: 32, 2: 16, 3: 8} vec_el_counts {0: 1, 1: 1, 2: 2, 3: 4} width 128 32 4 what is happening is: cpart_wid = max(lane_shapes.values()) ==> 32 part_count = max(vec_el_counts.values()) ==> 4 therefore 4x32 == 128 which is ridiculous and completely wrong, the answer (as can be seen from the manual analysis) should be 32. a first cut at a "fix" for this is to do something along the lines of: cpart_wid = 0 for i, ls in lane_shape.items(): cpart_wid = max(cpart_wid, width // ls) something like that. i'll try to work through it.
cpart_wid = max(lane_shapes.values()) width = cpart_wid * part_count replaced with: + cpart_wid = 0 + width = 0 + for i, lwid in lane_shapes.items(): + required_width = lwid * vec_el_counts[i] + print(" required width", cpart_wid, i, lwid, required_width) + if required_width > width: + cpart_wid = lwid + width = required_width commit 316c02716c365eee298ca3da5a65c3be58b57a1e (HEAD -> master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Wed Oct 13 15:44:19 2021 +0100 fix issue where width was being computed based on 2 maximum values actually needed is to multiply the number of elements by the width of an element and use that to determine which is greater
ok, added this example: # example "exponent" # https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/ # 1xFP64: 11 bits, one exponent # 2xFP32: 8 bits, two exponents # 4xFP16: 5 bits, four exponents # 4xBF16: 8 bits, four exponents vec_el_counts = { 0: 1, # QTY 1x FP64 1: 2, # QTY 2x FP32 2: 4, # QTY 4x FP16 3: 4, # QTY 4x BF16 } widths_at_elwidth = { 0: 11, # FP64 ew=0b00 1: 8, # FP32 ew=0b01 2: 5, # FP16 ew=0b10 3: 8 # BF16 ew=0b11 } which is returning these results: dict_keys([5, 8, 11, 13, 16, 21, 24, 29]) which doesn't match expected: # expected results: # # |31| | |24| 16|15 | | 8|7 0 | # |31|28|26|24| |20|16| 12| |10|8|5|4 0 | # 32bit | x| x| x| | x| x| x|10 .... 0 | # 16bit | x| x|26 ... 16 | x| x|10 .... 0 | # 8bit | x|28 .. 24| 20.16| x|11 .. 8|x|4.. 0 | # unused x x drat. i don't know why. it's calculating the correct (expected) width of 32. unsure what's going on, here.
i added another example which i *think* was what the (original) test was supposed to be, from comment #20 l = {0: signed(5), 1: signed(6), 2: signed(12), 3: signed(24)} i set the following vector element widths: + widths_at_elwidth = { + 0: 24, # QTY 1x 24 + 1: 12, # QTY 1x 12 + 2: 5, # QTY 2x 5 + 3: 6 # QTY 4x 6 + } and it produced reasonable expected results. commit 9ea45dcbddc385f394e2c86c4dd4f2a5b98686b9 (HEAD -> master, origin/master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Wed Oct 13 17:55:49 2021 +0100 create quick test of what 24-12-5-6 layout was likely-expected to be
I read through all the new layout() code, looks mostly fine, though afaict the computation of bmask is incorrect if the object was to figure out which partitions are always padding.
(In reply to Jacob Lifshay from comment #106) > I read through all the new layout() code, looks mostly fine, though afaict > the computation of bmask is incorrect if the object was to figure out which > partitions are always padding. yes, i can't quite work it out. if you can fix that, these algorithmic things do my head in.
(In reply to Luke Kenneth Casson Leighton from comment #107) > (In reply to Jacob Lifshay from comment #106) > > I read through all the new layout() code, looks mostly fine, though afaict > > the computation of bmask is incorrect if the object was to figure out which > > partitions are always padding. > > yes, i can't quite work it out. if you can fix that, these algorithmic > things do my head in. yeah, I can do that tomorrow (actually today, but you get what I meant).
(In reply to Jacob Lifshay from comment #108) > yeah, I can do that tomorrow (actually today, but you get what I meant). :) the blanking mask is important to allow submodules to not allocate unnecessary gates. recap: * with the width(s) now being the vector element width, intermediary scalar computations "mb = Signal(6)" for e.g. rlwinm will now be straight-SIMDified without any other code changes other than s/Signal/context.SignalKls" with ctx: mb = ctx.SignalKls(6) # 6=> lane_shapes => {0:6,1:6,2:6,3:6} * by making XLEN a SimdShape it becomes possible to "notify" ctx.SignalKls (when that class is set equal to SimdSignal), "i do not want you to allocate something that is a vector element width of XLEN, i want you to allocate a *fixed width* SIMD Signal" class SimdShape(Shape): __init__(self, lane_shapes=None, signed, fixed_width=None) detect type (fixed, laneshapes) and store that * by having SimdShape itself store the detected type (fixed type or laneshape type) when arithmetic is performed the layout() function may be re-called to re-compute the points, dpoints, pbit, blanking mask etc. thus ss + 5 when in laneshape mode will add five to ALL of the laneshape Vector Element lengths (!!!) and as as result the overall width will go up by (like) four TIMES five but when added in "fixed" mode it should add five to the *whole* length then recompute the Vector Lane widths using the partition counts. obviously here if the programmer fails to take into account the partition allocation some of the computed Vector Element widths will not fit (XLEN=64 +3 will be 67 which is not divisible by 8) but that is their problem to sort out * with the arithmetic being *dual purpose* and operating on laneshapes it makes no sense to have XOR or AND etc operators, (which is why i said "rudimentary"), what does it mean to OR in bits into a variable length completely different set of Vector lengths? it does however make sense to support divide multiply shift add and subtract (by integers) but not much else, and an exception should be thrown. * unless the arithmetic is part of SimdShape the remapping (calling of layout()) cannot occur hence why a separate class is not really useful * with only rudimentary arithmetic making sense it is not a general purpose class it is highly specialised and limited and again does not make sense to be separate.
(In reply to Jacob Lifshay from comment #87) > And I agree with her there, nmigen has always been scalar, not being funny: stating that you agree with whitequark is actually very dangerous for the project. it will be used as ammunition to sabotage what we are doing. > where each Value > is a single value, not a list of values. Basically everybody expects > nmigen's Value to be a scalar. If you insist on rewriting nmigen to change > that fundamental assumption, you'll end up with a different language, no, false. did we design totally new Power ISA Scalar v3.0B because of SVP64? this is the exact same thing. the entirety of the nmigen language *becomes* vectorised... WITHOUT CHANGING THAT LANGUAGE IN ANY WAY please this is very frustrating that you do not get this or accept it and keep on advocating massive bodies of unnecessary work as a result. > that > is called nmigen only because you are confusing your fundamental > assumption-breaking no confusion. no assumptions are broken. please study the RFC and read the code until you undertand properly. if you do not understand please ask questions instead of continuing to advocate alternative designs. > oh, sad and surprised to hear that...I'll wait a while and then see if I can > convince whitequark to let you back, i will respond on list > > made it bluntly > > clear, as part of reneging on the agreement back in april, that this > > decision was made. > > > > whether the code *can* support full abstraction (which with a mere > > 420-line diff, it can), whitequark unilaterally decided that that was > > completly irrelevant. > > Well, I'm saying your 90% of the way there, but the last 10% of abstraction > is the harder part... it's actually been pretty easy, just a hell of a lot of work (and to be honest i am annoyed that, again, you haven't helped with Cat, Repl, Part, or any of the recent additions, which i asked you to help with) there's over THIRTY operators. i really needed your help in implementing them and you have instead spent considerable time snd resources fighting every single step of the way to *not* help. i do not know why. leaving that aside for you to think about there are two REALLY complex ones: * Array (and ArrayProxy). * ast.Switch aka SimdSignal.__Switch__. i did the analysis, under bug #458, and whoo, even just the analysis takes a hell of a lot of explaining. unfortunately because ast.Switch is the basis of m.If, m.FSM and m.Switch, i can't provide "full proof" until it's actually 100% implemented. i need 1 clear week in which to implement it. i could have had it finished over 10 days ago if you had not kept fighting me with alternative designs that i have already documented and explained would be catastrophic failures, and had cooperated and helped with Part, Cat, Repl etc. if you would like to rectify that you can help implement the __Slice__ override. > it's easier and more consistent to just have a separate > API that mirrors nmigen's API, allowing unmodified nmigen to work with our > code, as well as making our code much clearer. This wouldn't require that > much effort, since we can use most of the existing PartitionedSignal code > with minor modifications and renaming. it results in a massive duplication and msintenance burden. please read the RFC, i documented clearly why what you advocate would be disastrous in multiple ways. last section: https://libre-soc.org/3d_gpu/architecture/dynamic_simd/ > > > > whereas > > > SimdSignal acts like a *list* of scalar values -- aka. a Vector. > > > > yes. and therefore the entire suite of operators of SimdSignal can be > > *defined* to act in list-form, at every single level in the entirety of ast.* > > operations. > > I'm saying the existing design only does a 90% job, it covers list-form +, > -, *, &, ^, |, .eq, etc. > > It does *not* cover Shape.width, Signal.like, etc. > > I'm saying we should finish off getting to 100% of the API looking like a > transparently vectorized (aka SIMT) SIMT i keep telling you is the wrong word to use. you mean SIMD > version of nmigen > Shape/Signal/Value/etc., jacob: please listen to what i am telling you. what you envisage is EXACTLY what is being designed. > just with different names, since that will make it > much easier to use/understand. Note that Shape.width (lane width) not being > a plain integer comes from lanes being XLEN-width, not from being > vectorized. those are orthogonal unrelated changes. We could have a totally > scalar Signal/Value/etc. that are all XLEN-ified, that would require > modifying Shape.width to be multivalued. yes. that is what lane_shapes is for. > We could also vectorize our totally > 64-bit-only code, that would leave Shape.width == 64 (since that's the > *element* width), even if a vector was 256-bits wide. > > have you run the unit test and / or contributed to the individual modules? > > Yes, I have -- don't forget that I'm the one who originally wrote the > PartitionedAdd and PartitionedMultiply -- the inspiration for this whole > thing. i know, it's brilliant. i meant, the more recent ones. Assign, Cat, Repl, and the other (older) ones __eq__ etc. did you see that there are unit tests which show that scalar-vector interaction pass and function perfectly? Signal.like() i haven't got round to doing a unit test yet, i expect it to work 100% because UserValue is designed specifically to allow exactly that, through the Uservalue.lower() function. i know exactly what the requirements are having been working on this for over 18 months and if i am honest i am getting quite annoyed that you keep telling me "what you have designed is a failure and clearly can't work" rather than taking the time to understand *how* it works, when i keep telling you it does, in fact, actually work.
(In reply to Luke Kenneth Casson Leighton from comment #109) > (In reply to Jacob Lifshay from comment #108) > but when added in "fixed" mode it should add five to the > *whole* length then recompute the Vector Lane widths > using the partition counts. obviously here if the > programmer fails to take into account the partition > allocation some of the computed Vector Element widths > will not fit (XLEN=64 +3 will be 67 which is not > divisible by 8) but that is their problem to sort out I'd argue strongly that we should specifically not do that, what we should do instead is mirror how XLEN works in the OpenPower spec pseudo-code, where XLEN + 5 means that each per-vector-element instance ends up with types 5 bits wider, so: elwid=8: XLEN + 5 -> element width of 8 + 5 == 13 elwid=16: XLEN + 5 -> element width of 16 + 5 == 21 elwid=32: XLEN + 5 -> element width of 32 + 5 == 37 elwid=64: XLEN + 5 -> element width of 64 + 5 == 69 The above is the semantics I explicitly rely on in the pseudo-code for addg6s (where it creates values with a width of XLEN + 4, because it needs 4 extra bits to prevent the logic from overflowing), we should not make SimdSignal's XLEN behave differently just because you like doing arithmetic on the full SIMD width.
(In reply to Jacob Lifshay from comment #111) > (In reply to Luke Kenneth Casson Leighton from comment #109) > > (In reply to Jacob Lifshay from comment #108) > > but when added in "fixed" mode it should add five to the > > *whole* length then recompute the Vector Lane widths > > using the partition counts. obviously here if the > > programmer fails to take into account the partition > > allocation some of the computed Vector Element widths > > will not fit (XLEN=64 +3 will be 67 which is not > > divisible by 8) but that is their problem to sort out > > I'd argue strongly that we should specifically not do that, what we should > do instead is mirror how XLEN works in the OpenPower spec pseudo-code, where > XLEN + 5 means that each per-vector-element instance ends up with types 5 > bits wider, so: > > elwid=8: XLEN + 5 -> element width of 8 + 5 == 13 > elwid=16: XLEN + 5 -> element width of 16 + 5 == 21 > elwid=32: XLEN + 5 -> element width of 32 + 5 == 37 > elwid=64: XLEN + 5 -> element width of 64 + 5 == 69 > > The above is the semantics I explicitly rely on in the pseudo-code for > addg6s (where it creates values with a width of XLEN + 4, because it needs 4 > extra bits to prevent the logic from overflowing), we should not make > SimdSignal's XLEN behave differently just because you like doing arithmetic > on the full SIMD width. After thinking a bit, the actual issue is that element-widths are the thing arithmetic needs to happen on -- you can come up with whatever algorithm you like for deciding what full simd width to use, as long as the element widths take priority.
(In reply to Jacob Lifshay from comment #112) > After thinking a bit, the actual issue is that element-widths are the thing > arithmetic needs to happen on yes, good, you understand now. > -- you can come up with whatever algorithm you > like for deciding what full simd width to use, > as long as the element widths > take priority. (that algorithm is: layout()) there are actually two separate and distinct cases. 1) the add6gs example you gave (which is the same as the rwlwinm example) 2) the times where the whole SIMD Signal is defined by the register file width (64 bit fixed width) (2) will come *in* from the definition of the regspecs. yes the priority is definitely (1) - to be honest we can probably get away with not having (2) for now and see how it goes. edit: but you see now, why because the arithmetic needs to be done simultaneously on *all* vector element widths at *all* elwidths, only rudimentary arithmetic can be performed? bitwise OR of 0x9 onto a simultaneous vector width of 8 and 16 makes absolutely no sense.
(In reply to Jacob Lifshay from comment #112) > After thinking a bit, the actual issue is that element-widths are the thing > arithmetic needs to happen on -- you can come up with whatever algorithm you > like for deciding what full simd width to use, as long as the element widths > take priority. with likewise sone thought, tending to agree with you, here. the fixed_width scenario, after creating the dictionary of Vector Element Widths (vec_el_widths in layout()) i *think* that from that point onwards it's ok to do sums on those widths (basic ones!) if people really really really want to do things based on fixed_width then, well, they declare a new SimdShape. this is based on the (preliminary) premise that if you set a fixed_width of say 64, and combined that with part_count of {1,2,4,8} to auto-create vec_el_widths of {64,32,16,8} and you add say 2 to that (to create room for carry-in and carry-out in the ADD ALU for example) you should reasonably expect a new vec_el_widths of {64+2,32+2,16+2,8+2} this *even though* you initially created the SimdShape with a fixed_width. i updated the url btw. https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/?updated
(In reply to Luke Kenneth Casson Leighton from comment #110) > (In reply to Jacob Lifshay from comment #87) > > > And I agree with her there, nmigen has always been scalar, > > not being funny: > stating that you agree with whitequark is actually very dangerous > for the project. it will be used as ammunition to sabotage what we are > doing. > > > where each Value > > is a single value, not a list of values. Basically everybody expects > > nmigen's Value to be a scalar. If you insist on rewriting nmigen to change > > that fundamental assumption, you'll end up with a different language, > > no, false. did we design totally new Power ISA Scalar v3.0B because of > SVP64? nope, but SVP64 instructions *are* encoded differently (they have a svp64 prefix), which is just like the renaming i'm advocating for. By renaming, that could be something like: from simd_nmigen import Signal, Module, Value, Shape > > this is the exact same thing. > > the entirety of the nmigen language *becomes* vectorised... > WITHOUT CHANGING THAT LANGUAGE IN ANY WAY This only works if *every* property of our new types that nmigen inspects (including Shape.width and Signal's inputs being `Shape.cast`ed) behaves exactly like how nmigen expects, which isn't the case currently since you picked Shape.width to be full-simd-width instead of element width, among other discrepancies. > > Well, I'm saying your 90% of the way there, but the last 10% of abstraction > > is the harder part... > > it's actually been pretty easy, just a hell of a lot of work (and to > be honest i am annoyed that, again, you haven't helped with Cat, Repl, > Part, or any of the recent additions, which i asked you to help with) well, actually, I wrote Cat and Repl implementations (in simd_signal), i think they just got kinda ignored, or was written shortly after you wrote Cat and Repl, icr. > there's over THIRTY operators. i really needed your help in implementing > them and you have instead spent considerable time snd resources fighting > every single step of the way to *not* help. i do not know why. > leaving that aside for you to think about I'm trying to help, but part of helping is trying to get the API design correct before we spend even more work baking an API into all our code that ends up being changed because it didn't actually work as originally designed... > > > there are two REALLY complex ones: > > * Array (and ArrayProxy). > > * ast.Switch aka SimdSignal.__Switch__. > > i did the analysis, under bug #458, and whoo, even just the analysis > takes a hell of a lot of explaining. > > unfortunately because ast.Switch is the basis of m.If, m.FSM and > m.Switch, i can't provide "full proof" until it's actually 100% > implemented. I said I would implement and I will...once we decide which API it'll be based on (hence arguing in here about the API). We're 99% of the way through deciding API. While SimdSignal itself will probably be faster to write, actually *using* it, as it was designed, would be waay more complex, because it does change the semantics of how nmigen operates, such as requiring the shape input to Signal be an int or dict of ints, whereas all our code expects to be able to pass Shape-castable types to Signal, example: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/div/fsm.py;h=1b22ca6f3f145f58e547451f496106e07bcc188d;hb=HEAD#l102 If we add some lines to layout/SimdLayout, then it automatically handles *all* Shape-castable inputs: class SimdLayout(Shape): def __init__(self, lane_shapes=1, ...): # Signal() gives shape == unsigned(1), so we match that. # We use lane_shapes=signed(...) rather than a `signed` argument. ... if isinstance(lane_shapes, SimdLayout): # next two lines should probably be factored out into lane_shapes property s = lane_shapes.signed lane_shapes = SimdMap.map(lambda v: Shape(v, s), lane_shapes.lane_widths) else: lane_shapes = SimdMap.map(Shape.cast, lane_shapes) lane_widths = {} signed = None for i in vec_el_counts: if signed is None: signed = shape.signed else: assert signed == shape.signed lane_widths[k] = shape.width assert signed != None # rest of code uses lane_widths, not lane_shapes ... > > if you would like to rectify that you can help implement the > __Slice__ override. I can work on slicing. > > > > > it's easier and more consistent to just have a separate > > API that mirrors nmigen's API, allowing unmodified nmigen to work with our > > code, as well as making our code much clearer. This wouldn't require that > > much effort, since we can use most of the existing PartitionedSignal code > > with minor modifications and renaming. > > it results in a massive duplication and msintenance burden. please > read the RFC, i documented clearly why what you advocate would be > disastrous in multiple ways. yeah, now i can see how it could have a big maintenance burden, however imho forking nmigen may have a bigger maintenance burden. > > last section: > > https://libre-soc.org/3d_gpu/architecture/dynamic_simd/ Replying to wiki concerns: > All of these ideas, unfortunately, are extremely costly in many different > ways: > > 1. Any overrides of any kind give a catastrophically fatal impression > that the high-level language behaviour might in some tiny way be > different, > purely because of the very *existence* of an override class. I think that won't be much of a problem if we specifically state throughout our docs that our API is intended to be equivalent to nmigen. This is similar to how pypy provides extra features over cpython, yet implements the exact same language. No one thinks that pypy's existence means that cpython is useless... > 2. Proponents of such override mechanisms (and there have been many) > fundamentally misunderstand the distinction between Type 1 (AST) > low-level > and Type 2 (dsl.Module) high-level nmigen language constructs, and > miss the fact that dsl.Module (Type 2) is 100% implemented *in* AST > (Type 2) constructs. you mean implemented in AST (Type 1)? No, I understand your point about that completely, I just think that your point isn't sufficient justification for permanently forking nmigen. > 3. Wrapper classes are a maintenance nightmare. Unless included in the > library itself, any update to the library being wrapped is a huge risk > of breaking the wrapper. well, if we design the wrappers such that we only use the public API of nmigen, nmigen's API stability prevents us from needing to rewrite our code -- it is the exact same stability that prevents *all other* nmigen users from needing to rewrite their code when nmigen is updated. Wrappers, if properly designed, don't ever go in and mess with nmigen's internals, they only provide an API that looks like nmigen. > Long-term this is completely unsustainable. > Likewise monkey-patching. Monkey-patching is messing with nmigen's internals and I never advocated for it. > 4. Wrapper classes introduce huge performance degradation. Every function > requires one additional function call. Every object instantiated > requires > both the wrapper class to be instantiated and the object being wrapped, > every single time. Is this really such a performance sink? if so, it makes me think using Python was a bad choice...Rust has no issues with wrappers, in fact, it has a special type layout specifically intended for wrappers: `#[repr(transparent)]`. Rust totally lives up to the zero-cost abstractions promise. > A single wrapper class is never enough: the entire > class hierarchy, everything that is ever instantiated by a "wrapped" > class instance, also requires wrapping. Well, since we're only using SIMD in ALUs, we only need to wrap nmigen's core API, which is only 5 or 6 classes (ignoring the ones we're overriding anyway) -- I did go through the whole public api of nmigen.hdl and check them all. Class list (additional classes we'd need to override): Value, Const, Statement, Assign, Module, and maybe Record > This quickly > gets completely out of hand and diverts developer time into a > nightmare time-sink that has actively harmful consequences *even > once completed*. > Memory requirements double; performance degrades. I highly doubt memory requirements are anything to worry about, i'd expect it to be on the order of 100MB or 1GB at the most. > Both are unacceptable. I think that's exaggerating... > 5. "Explicit coding" is actually more costly even than for-loop > code duplication. It is "The Right Thing (tm)" in order to achieve > optimal gate-level design but requires hyper-aware and hyper-intelligent > developers, whom, if ever lost, take knowledge of the internal workings > of ultra-complex code with them. Agreed. > 6. "Low-level classes" have in fact already been implemented: over a > dozen modules utilised and combined behind PartitionedSignal exist > and are already in use. PartitionedMux was one of the very first > SIMD-aware > "Type 1" AST operators written. The problem is that it's nowhere near > enough. Conversion of 100,000 lines of readable nmigen HDL using > Type 2 (m.If/Elif) high-level constructs to **not** use those > constructs and return to AST Mux, Cat, and so on is a completely > unreasonable expectation. Agreed, though I'll admit that converting everything to use Mux was my original plan a year ago. > 7. "Replacements for dsl.Module" aside from coming with the catastrophic > implication that they provide alternative behaviour from dsl.Module > are a heavy maintenance burden. Not only that but there is the risk > that, unless actually included *in* nmigen, the underlying AST modules > that they use might change, or be deprecated, or have bugs fixed which > break the replacement. I addressed those points above. > 8. "Compilers". It is very well known that any python program that outputs > another python program as an ASCII output may then immediately read that > ASCII output from within the very program that created it, `eval` it, > and then execute it. That being the case, you might as well skip the > output to ASCII and the eval part, and create classes that > dynamically and directly instantiate the desired target behaviour. > This saves months of effort but is a common mistake, frequently made. well, compilers do have the benefit of being waay easier to debug, since you can see their whole output...the alternative (just creating classes at runtime) is waay more opaque. Back to bugzilla message: > > > > > > > > whereas > > > > SimdSignal acts like a *list* of scalar values -- aka. a Vector. > > > > > > yes. and therefore the entire suite of operators of SimdSignal can be > > > *defined* to act in list-form, at every single level in the entirety of ast.* > > > operations. > > > > I'm saying the existing design only does a 90% job, it covers list-form +, > > -, *, &, ^, |, .eq, etc. > > > > It does *not* cover Shape.width, Signal.like, etc. > > > > I'm saying we should finish off getting to 100% of the API looking like a > > transparently vectorized (aka SIMT) > > SIMT i keep telling you is the wrong word to use. you mean SIMD I mean SIMT, not SIMD. I'm explicitly referring to what the code looks like when compared to what SIMT/SIMD looks like from a SW programmer's perspective. C++ SIMD code looks like this: void f(f32x4 a, f32x4 b) { f32x4 c = a + b; // element-wise add -- looks same, good! static_assert(sizeof(a) == 16); // size of whole vector -- different, bad u32x4 bits = (u32x4)a; // bitcast -- different, bad // fp->int conversion -- different, bad u32x4 values = __builtin_convertvector(a, u32x4); i32x4 gt = a > b; // result is not bool, bad f32x4 max = (f32x4)((i32x4)a & gt | (i32x4)b & ~gt); // mux -- can't use `if`, bad } Notice how all types are separate wide SIMD types and operations often do something other than the element-wise version of what scalar code would do. C++ SIMT code looks like this: void f(float a, float b) { float c = a + b; // element-wise add static_assert(sizeof(a) == 4); // size of element uint32_t bits = bit_cast<uint32_t>(a); // bitcast uint32_t values = (uint32_t)a; // fp->int conversion bool gt = a > b; float max = b; if(gt) max = a; } Notice how all types are element types and operations always operate element-wise and look exactly like scalar code and how control-flow constructs can be used. > > > version of nmigen > > Shape/Signal/Value/etc., > > jacob: please listen to what i am telling you. what you envisage is EXACTLY > what is being designed. good! > > just with different names, since that will make it > > much easier to use/understand. Note that Shape.width (lane width) not being > > a plain integer comes from lanes being XLEN-width, not from being > > vectorized. those are orthogonal unrelated changes. We could have a totally > > scalar Signal/Value/etc. that are all XLEN-ified, that would require > > modifying Shape.width to be multivalued. > > yes. that is what lane_shapes is for. I'm saying it's specifically Shape.width that should be multi-valued, because Shape.width is part of the nmigen API so it should be the SIMT element type's width, not the full-SIMD-vector's width. lane_shapes, because it's not part of the nmigen API, can be whatever we choose.
(In reply to Luke Kenneth Casson Leighton from comment #113) > (In reply to Jacob Lifshay from comment #112) > > > After thinking a bit, the actual issue is that element-widths are the thing > > arithmetic needs to happen on > > yes, good, you understand now. I've always understood that part, I've just been trying to tell you that cuz it appeared as though you were advocating for stuff to work on the full simd width and then calculate element widths based on that. > > > -- you can come up with whatever algorithm you > > like for deciding what full simd width to use, > > as long as the element widths > > take priority. > > (that algorithm is: layout()) > > there are actually two separate and distinct cases. > > 1) the add6gs example you gave (which is the same as the rwlwinm example) > > 2) the times where the whole SIMD Signal is defined by the register file > width (64 bit fixed width) I've been planning on register i/o signals just being set to XLEN, since that gives a 64-bit simd signal when vec_el_counts is setup for 64-bit vectors, it also automatically adapts for if we want a 32-bit cpu by setting vec_el_counts to: vec_el_counts = { IntElWid.I32: 1, IntElWid.I16: 2, IntElWid.I8: 4, } > edit: but you see now, why because the arithmetic needs to be done > simultaneously on *all* vector element widths at *all* elwidths, > only rudimentary arithmetic can be performed? that's not actually the case... > bitwise OR of 0x9 onto a simultaneous vector width of 8 and 16 makes > absolutely no sense. There aren't many uses, but it still makes sense for XLEN | 9: elwid=8: 8 | 9 = 9 elwid=16: 16 | 9 = 25 elwid=32: 32 | 9 = 41 elwid=64: 64 | 9 = 73 Example use: `some_shape | 0x7` to round it up to 1 less than a multiple of 8 Also, when implementing __add__ and friends, don't forget cases like: XLEN + XLEN % 2
Other cases where XLEN should probably be something separate than a SimdShape: XLEN - pop_count(...) log2(XLEN) ceil-div: (temporarily has negative values, i expect layout to not work with negative inputs) a = -(-b // 5)
(In reply to Jacob Lifshay from comment #117) > Other cases where XLEN should probably be something separate than a > SimdShape: > XLEN - pop_count(...) > > log2(XLEN) > > ceil-div: > (temporarily has negative values, i expect layout to not work with negative > inputs) > a = -(-b // 5) exponent_bias = (1 << (fp_exponent_sizes[XLEN] - 1)) - 1 unbiased_exponent = Signal(signed(fp_exponent_sizes[XLEN] + 1)) biased_exponent.eq(exponent_bias + unbiased_exponent)
(In reply to Jacob Lifshay from comment #108) > (In reply to Luke Kenneth Casson Leighton from comment #107) > > (In reply to Jacob Lifshay from comment #106) > > > I read through all the new layout() code, looks mostly fine, though afaict > > > the computation of bmask is incorrect if the object was to figure out which > > > partitions are always padding. > > > > yes, i can't quite work it out. if you can fix that, these algorithmic > > things do my head in. > > yeah, I can do that tomorrow (actually today, but you get what I meant). Done. I changed bmask to be such that bmask & (1 << i) is set if partition index i is always padding. Note that there is one more partition than the number of ppoints, since ppoints separate partitions, not are partitions. Demo manually worked out in comments: https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=6c32022ec750433ff5adca21f4345cfca51181e2;hb=HEAD#l324 # combined with vec_el_counts {0:1, 1:1, 2:2, 3:4} we have: # elwidth=0b00 1x 24-bit # elwidth=0b01 1x 12-bit # elwidth=0b10 2x 5-bit # elwidth=0b11 4x 6-bit # # bmask<--------1<----0<---------10<---0<-------1<0<----0<---0<----00<---0 # always unused:| | | || | | | | | | || | # 1111111111000000 1111111111000000 1111111100000000 0000000000000000 # | | | || | | | | | | || | # 0b00 xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxx........ ..............24| # 0b01 xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxx..........12| # 0b10 xxxxxxxxxxxxxxxx xxxxxxxxxxx....5|xxxxxxxxxxxxxxxx xxxxxxxxxxx....5| # 0b11 xxxxxxxxxx.....6|xxxxxxxxxx.....6|xxxxxxxxxx.....6|xxxxxxxxxx.....6| # ^ ^ ^^ ^ ^ ^ ^ ^ ^^ # ppoints: | | || | | | | | || # | bit-48 /\ | bit-24-/ | | bit-12 /\-bit-5 # bit-54 bit-38-/ \ bit-32 | bit-16 / # bit-37 bit-22 bit-6 So, in that example bmask = 0b101001000000
(In reply to Jacob Lifshay from comment #119) > Done. I changed bmask to be such that bmask & (1 << i) is set if partition > index i is always padding. Note that there is one more partition than the > number of ppoints, since ppoints separate partitions, not are partitions. > > Demo manually worked out in comments: > https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/ > layout_experiment.py;h=6c32022ec750433ff5adca21f4345cfca51181e2;hb=HEAD#l324 darn it closed the tab by mistake lost a reply. try again. this is really good. saved several days, i would have taken that long to work this out. thank you. the demo / test is really clear, shows what is going on. also i noticed you did incremental commits, really appreciated. each one took like only 30 seconds to review, really easy to understand. i am under enormous pressure at the moment so this was important. Slice and Part next. will find the bugreport. (edit: bug #716)
context: this is starting to get complicated so i am moving the discussion back to the bugtracker http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-October/003972.html i just put in some unused code (in advance of setting up the examples in the URL https://libre-soc.org/3d_gpu/architecture/dynamic_simd/shape/ https://git.libre-soc.org/?p=ieee754fpu.git;a=commitdiff;h=417f7f93bf3baf576e537cde32f118b2a50eab46 it's over 50% comments because it's... complicated (dual mathematics rules, PRIORITY_FIXED, PRIORITY_ELWID). basically: when both LHS and RHS are PRIORITY_FIXED (both had a lane_shapes of None), then it should be obvious that the "Correct Thing To Do" is just to multiply the two fixed_widths together and create a SimdShape out of that. why? because *both* LHS and RHS were established (created) with the requirement that "elwidths shall be auto-computed to fill the entirety of the partition, as determined by overall width and the vec_el_counts, regardless of what elwid is set to. on the other hand, when *either* LHS or RHS have had an elwidths argument given, it is the elwidths that have to be preserved (multiplied) individually: result.lane_shapes[0] = LHS.lane_shapes[0] * RHS.lane_shapes[0] result.lane_shapes[1] = LHS.lane_shapes[1] * RHS.lane_shapes[1] ... ... the only case i'm not sure about is PRIORITY-BOTH, there. i don't believe it's safe for the return result to be PRIORITY-BOTH, because you could have LHS[0] = 1000, RHS[0] = 1, and LHS[1] = 1, RHS[1] = 999, and a fixed_width doesn't work out, there.