Bug 132 - SIMD-like nmigen signal for partitioning
Summary: SIMD-like nmigen signal for partitioning
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: ALU (including IEEE754 16/32/64-bit FPU) (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL: https://libre-soc.org/3d_gpu/architec...
Depends on: 168 176 467 575 708 718 719 720 734 735 167 171 172 173 458 565 707 709
Blocks: 115 713 48
  Show dependency treegraph
 
Reported: 2019-08-14 10:40 BST by Luke Kenneth Casson Leighton
Modified: 2023-09-05 05:27 BST (History)
2 users (show)

See Also:
NLnet milestone: NLnet.2019.02.012
total budget (EUR) for completion of task and all subtasks: 550
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation: 62
child tasks for budget allocation: 707 709
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2019-08-14 10:40:12 BST
an idea: create a new data type (derived from Value or Signal) in nmigen
that can be dynamically partitioned.

* #NNN   def __invert__(self):              done
* bug #172   def __neg__(self):             done
* bug #172   def __add__(self, other):      done
* bug #718   def __radd__(self, other):      TODO
* bug #172   def __sub__(self, other):      done
* bug #718   def __rsub__(self, other):      TODO
* bug #173   def __lshift__(self, other):   done
* bug #718    def __rlshift__(self, other):  TODO 
* bug #173    def __rshift__(self, other):  done
* bug #718    def __rrshift__(self, other):  TODO
* #NNN   def __mul__(self, other):
* #NNN   def __rmul__(self, other):
* #NNN   def __mod__(self, other):
* #NNN   def __rmod__(self, other):
* #NNN   def __floordiv__(self, other):
* #NNN   def __rfloordiv__(self, other):
* #NNN   def __abs__(self):                 TODO evaluate
* bug #719   def as_signed(self):               TODO unit test
* bug #719   def as_unsigned(self):             TODO unit test
* bug #176   def bool(self):                done
* bug #176   def any(self):                 done
* bug #176   def all(self):                 done
* bug #176   def xor(self):                 done
* bug #171   comparison operators, __eq__, __ne__, __lt__ etc.     done
* bug #707   cat            done
* bug #709   eq (assign)    done
* bug #458   part, mux, cat, Switch, Repl, Assign etc. in ast.py   DOING
* bug #716   Slice and __getitem__          TODO and document #733
* bug #734   SimdScope and ElwidPartType    TODO
* bug #713   def shape(self):               TODO
* bug #718   review __r*__ functions for noncommutativity          TODO
* bug #720   static def cast(val)           TODO
* #NNN     def matches(self, *pattern):     TODO unit testing

bug #735 bugfix shift  nonessential
Comment 1 Luke Kenneth Casson Leighton 2019-08-14 10:43:01 BST
for the dynamic partitioning (SIMD ALUs) we need a way to break a signal at
runtime into parts.  we already have a "multiply" operator that can do that,
however what if we took that to the logical extreme and made *all* operators
partitionable?  add, compare, or, shift, len - everything that's in the
Value class.

the alternative is to complicate the layout of existing pipelines through
explicit creation of dynamically-partitioned "clones".
Comment 3 Luke Kenneth Casson Leighton 2019-08-14 12:03:53 BST
(In reply to Jacob Lifshay from comment #2)
> See also:
> https://salsa.debian.org/Kazan-team/simple-barrel-processor/blob/master/src/
> multiply.py

yehyeh, that's the one.  if we can create operators >=, add, sub, etc.
then there's no need to explicitly confuse the entire design of a pipeline
stage.

what will be needed in the __init__ constructors is, instead of just
"self.a = Signal(width)" it will be:

self.a = PartitionedSignal(self.ctx, pspec, width)

the reason is: the context contains (will contain) not just the SIMD
partitioning information, it will contain the "cancellation mask"
which will be used to cancel *subsets* of the SIMD operation rather
than all of it.

the only thing is: multiply is *not* going to be easy, because it's
multi-stage.  so i think there, we would have to do something quite
complex inside the multiply, like:

class PartitionedSignal:

   def __mul__(self, value):
       if self.ctx.operator == FIRST_STAGE_MULTIPLY:
           # do wallace multiply step 1
       elif self.ctx.operator == SECOND_STAGE_MULTIPLY:
           # do wallace multiply step 2

or something like that.  needs thought.


here's the full list, from Value - these will be the functions that
are needed (and one from Signal as well - PartitionedSignal.like):


    def __invert__(self):
    def __neg__(self):

    def __add__(self, other):
    def __radd__(self, other):
    def __sub__(self, other):
    def __rsub__(self, other):
    def __mul__(self, other):
    def __rmul__(self, other):
    def __mod__(self, other):
    def __rmod__(self, other):
    def __div__(self, other):
    def __rdiv__(self, other):
    def __lshift__(self, other):
    def __rlshift__(self, other):
    def __rshift__(self, other):
    def __rrshift__(self, other):
    def __and__(self, other):
    def __rand__(self, other):
    def __xor__(self, other):
    def __rxor__(self, other):
    def __or__(self, other):
    def __ror__(self, other):

    def __eq__(self, other):
    def __ne__(self, other):
    def __lt__(self, other):
    def __le__(self, other):
    def __gt__(self, other):
    def __ge__(self, other):

    def __len__(self):
    def implies(premise, conclusion):

    def eq(self, value):

    def shape(self):

    def _lhs_signals(self):
    def _rhs_signals(self):
Comment 4 Luke Kenneth Casson Leighton 2019-08-14 12:12:06 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)

> the only thing is: multiply is *not* going to be easy, because it's
> multi-stage.  so i think there, we would have to do something quite
> complex inside the multiply, like:
> 
> class PartitionedSignal:
> 
>    def __mul__(self, value):
>        if self.ctx.operator == FIRST_STAGE_MULTIPLY:
>            # do wallace multiply step 1
>        elif self.ctx.operator == SECOND_STAGE_MULTIPLY:
>            # do wallace multiply step 2
> 
> or something like that.  needs thought.

all the other operators will be single-cycle so are not a problem.
Comment 5 Jacob Lifshay 2019-08-14 12:23:00 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)
> (In reply to Jacob Lifshay from comment #2)
> > See also:
> > https://salsa.debian.org/Kazan-team/simple-barrel-processor/blob/master/src/
> > multiply.py
> 
> yehyeh, that's the one.  if we can create operators >=, add, sub, etc.
> then there's no need to explicitly confuse the entire design of a pipeline
> stage.
> 
> what will be needed in the __init__ constructors is, instead of just
> "self.a = Signal(width)" it will be:
> 
> self.a = PartitionedSignal(self.ctx, pspec, width)
> 
> the reason is: the context contains (will contain) not just the SIMD
> partitioning information, it will contain the "cancellation mask"
> which will be used to cancel *subsets* of the SIMD operation rather
> than all of it.
> 
> the only thing is: multiply is *not* going to be easy, because it's
> multi-stage.  so i think there, we would have to do something quite
> complex inside the multiply, like:
> 
> class PartitionedSignal:
> 
>    def __mul__(self, value):
>        if self.ctx.operator == FIRST_STAGE_MULTIPLY:
>            # do wallace multiply step 1
>        elif self.ctx.operator == SECOND_STAGE_MULTIPLY:
>            # do wallace multiply step 2
> 
> or something like that.  needs thought.

for the simple ops, I used a fully general partitioning: see PartitionPoints's docs

multiply is complex enough that it only supports partitioning on byte boundaries with the partitions being naturally-aligned power-of-2-sized partitions.

Note that I did also write a partitioned adder and all partitioned bitwise ops (and/or/xor/not, but not shift/rotate/anything that communicates between bits) are identical to the non-partitioned ops.

the partitioned adder should be quite easily updated to an add/subtracter.

I really think we should have a separate control pipeline that handles all the tracking which instruction is in which partition and it just tells the data pipeline where it's partitioned and that's it. Otherwise, we will end up having to add more complexity to the already very complex multiplier, which would change it from barely understandable to incomprehensible for whoever didn't actually write it. the multiplier is designed so that it acts like a simple pipeline where every input comes out exactly N stages later, where N is configured by the instantiating code.

To be able to handle mul-add, the multiplier will need to expose the _intermediate_output value, since that's twice as wide as the inputs and contains the full multiply result.
Comment 6 Luke Kenneth Casson Leighton 2019-08-14 16:43:03 BST
(In reply to Jacob Lifshay from comment #5)

> whoever didn't actually write it. the multiplier is designed so that it acts
> like a simple pipeline where every input comes out exactly N stages later,
> where N is configured by the instantiating code.

If the code is not doing single cycle results we cannot use it.

The pipeline API is based on combinatorial stages, where the "register" part is handled in the data-opaque fashion.

This implies in turn that, just like div_core, parameterisation is needed where the mul code can be told, "please give me a combinatorial object that will deal with stage 0 of multiplication, which *I* will wrap with registers, NOT you".

And, "please now give me another combinatorial object, which will deal with stage 1. *I* will take responsibility for connecting the stage 0 to stage 1".

The div_core has a function which tells the caller of the API exactly how many "stages" are needed.  What was it...

http://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/div_rem_sqrt_rsqrt/core.py;h=cef62a343ad8da05d6c8746a30d8a276a050660f;hb=refs/heads/master#l49

DivPipeCoreConfig.n_stages.

That was it.

This is the architecture that we need. A MulPipeCoreConfig object, a n_stages function, the registers removed from the multiplier code, such that the Pipeline API can the explicitly construct combinatorial building blocks and join them together, just as was done with FPDIV.

Otherwise, yes, it gets into a hell of a mess.

With the multipath pipeline re-routing code I am writing (trying to write, sigh) it will be possible to construct rerouters that divert 64 bit MUL requests down a separate N-stage pipe, that comes *back* to the main pipe later, where 8/16/32 goes straight down a 1-clock path.

Having that kind of complex logic *inside* the multiplier is A pointless B a duplication of the pipeline API C something else I don't know what :)

Oh, yes: confusing, as it is a sync block, where every other ALU module is comb based (and the Pipeline API takes care of sync/registers).

Having the multiplier do multi stage is also going to be hell, because whilst the data goes in and comes out a few cycles later, the *control* information (ctx, etc) does not... 

Ah, now I see why you suggested a separate control pipe.

Yuck :)

Don't like it, at all.  Not because it wouldn't work, but because it is the only code that would need to do it.

And it is unnecessary.

If the multiplier code has an API that is very similar to that of div_core, the problem is solved.

Can you do that?
Comment 7 Luke Kenneth Casson Leighton 2019-08-14 16:44:49 BST
(In reply to Jacob Lifshay from comment #5)
.
> 
> Note that I did also write a partitioned adder and all partitioned bitwise
> ops (and/or/xor/not, but not shift/rotate/anything that communicates between
> bits) are identical to the non-partitioned ops.
> 
> the partitioned adder should be quite easily updated to an add/subtracter.

Great.

Also need comparators GE LE GT LT EQ NE. Can you do those as well?
Comment 8 Luke Kenneth Casson Leighton 2019-08-14 17:20:26 BST
https://forum.m-labs.hk/d/20-simd-partitionable-version-of-signal
Comment 9 Jacob Lifshay 2019-08-14 21:18:34 BST
(In reply to Luke Kenneth Casson Leighton from comment #8)
> https://forum.m-labs.hk/d/20-simd-partitionable-version-of-signal

we just need to give in -- it will be massively invasive because we need to handle partitioning properly for things like multiply-by-small-constant, extract sign bit, and all the other operations where we need to handle things on a per-partition basis and don't just call add, xor, etc.

If we use the partition API where every Signal is converted to PartitionedSignal, which is conceptually a Signal with a PartitionPoints, it allows optimizing for the case where it is statically known at instantiation time that only certain bit-positions will ever be partition boundaries. using a straight mask won't allow that and PartitionPoints has a method to generate a mask if needed.

the partitioned adder uses the above partition information to avoid inserting the extra bits used for stopping carry propagation where not needed. if PartitionedAdder was just given a mask, it would have to assume every bit could be in a separate partition and would end up with a final adder twice as wide as needed.

we should make a subtype of PartitionedPoints (NaturallyAlignedPartitionPoints?) that guarantees that all partitions at both instantiation and runtime are all power-of-2-sized and are naturally aligned (start at a bit index that is a multiple of their size). This is required for the current multiplier implementation, and by requiring N.A.P.P. instead of just PartitionPoints, we allow the divider and other ALUs to have a way to check at instantiation time that the partitions only partition down to the byte level and that they don't ever have to handle a 24 or 40-bit partition, for example.
Comment 10 Jacob Lifshay 2019-08-14 21:27:55 BST
(In reply to Luke Kenneth Casson Leighton from comment #7)
> (In reply to Jacob Lifshay from comment #5)
> .
> > 
> > Note that I did also write a partitioned adder and all partitioned bitwise
> > ops (and/or/xor/not, but not shift/rotate/anything that communicates between
> > bits) are identical to the non-partitioned ops.
> > 
> > the partitioned adder should be quite easily updated to an add/subtracter.
> 
> Great.
> 
> Also need comparators GE LE GT LT EQ NE. Can you do those as well?

should be easy, they should probably be implemented by adding carry-in, carry-out, overflow-out, and is-zero-out support to PartitionedAdder and implementing the compare ops in terms of subtract and check flags.

Note that branching on the compare result won't work (so bool() should raise an exception) since there are multiple results.

Mux should still work if the compare ops return all ones for true and all zeros for false, since then it's equivalent to a bitwise mux.
Comment 11 Jacob Lifshay 2019-08-14 21:34:56 BST
(In reply to Luke Kenneth Casson Leighton from comment #6)
> (In reply to Jacob Lifshay from comment #5)
> 
> > whoever didn't actually write it. the multiplier is designed so that it acts
> > like a simple pipeline where every input comes out exactly N stages later,
> > where N is configured by the instantiating code.
> 
> If the code is not doing single cycle results we cannot use it.

yes we can, we just need to tell the pipeline API "this takes 3 stages instead of one, so insert extra registers on the control signals"

> If the multiplier code has an API that is very similar to that of div_core,
> the problem is solved.
> 
> Can you do that?

Yes, but it will be invasive.
Comment 12 Luke Kenneth Casson Leighton 2019-08-14 22:17:13 BST
(In reply to Jacob Lifshay from comment #11)

> > 
> > If the code is not doing single cycle results we cannot use it.
> 
> yes we can, we just need to tell the pipeline API "this takes 3 stages
> instead of one, so insert extra registers on the control signals"

Which still does not take care of cancellation.

The multiplier code will now need to implement cancellation, which is a global mask (not a register-propagated signal).

Have a look at MaskCancellable.

The predicate mask (which is register based) and the stop mask (which is global per pipe) are opaque to the (combinatorial) Stage API.
Comment 13 Luke Kenneth Casson Leighton 2019-08-14 22:21:52 BST
I am still thinking through the implications of passing the predicate and partition spec into the PartitionedSignal thing.

Absolutely no way a full bit mask would be passed in. Just enough to specify the byte level partitioning.

One pain in the neck: the class will need a cat function and a mux function.

In a degenerate version of PartitionedSignal these would call Cat and Mux respectively.

Going to need some examples.
Comment 14 Jacob Lifshay 2019-08-14 22:33:29 BST
(In reply to Luke Kenneth Casson Leighton from comment #12)
> (In reply to Jacob Lifshay from comment #11)
> 
> > > 
> > > If the code is not doing single cycle results we cannot use it.
> > 
> > yes we can, we just need to tell the pipeline API "this takes 3 stages
> > instead of one, so insert extra registers on the control signals"
> 
> Which still does not take care of cancellation.

it's a simple data pipe, if a particular element is canceled, that pipeline slot will just be empty, just like divpipecore. the control pipeline can keep track of which elements have valid data and which have been canceled.

> 
> The multiplier code will now need to implement cancellation, which is a
> global mask (not a register-propagated signal).
the surrounding control hardware will just set the associated control signals such that the canceled/unused data elements are ignored.

> 
> Have a look at MaskCancellable.
> 
> The predicate mask (which is register based) and the stop mask (which is
> global per pipe) are opaque to the (combinatorial) Stage API.

that's just fine. think of the multiplier as identical to a combinatorial stage except it takes more than one clock cycle for the data to get through.
Comment 15 Jacob Lifshay 2019-08-14 22:37:31 BST
(In reply to Luke Kenneth Casson Leighton from comment #13)
> Going to need some examples.

see https://salsa.debian.org/Kazan-team/simple-barrel-processor/blob/master/test/test_multiply.py#L65

the second constructor arg is converted to a PartitionPoints instance internally.
Comment 16 Jacob Lifshay 2019-08-14 22:39:41 BST
(In reply to Jacob Lifshay from comment #14)

> it's a simple data pipe, if a particular element is canceled, that pipeline
> slot will just be empty, just like divpipecore. the control pipeline can
> keep track of which elements have valid data and which have been canceled.

note that the data pipe doesn't need to know which elements are empty: empty does not mean zeroed.
Comment 17 Luke Kenneth Casson Leighton 2019-08-14 23:02:52 BST
(In reply to Jacob Lifshay from comment #14)
> (In reply to Luke Kenneth Casson Leighton from comment #12)
> > (In reply to Jacob Lifshay from comment #11)
> > 
> > > > 
> > > > If the code is not doing single cycle results we cannot use it.
> > > 
> > > yes we can, we just need to tell the pipeline API "this takes 3 stages
> > > instead of one, so insert extra registers on the control signals"
> > 
> > Which still does not take care of cancellation.
> 
> it's a simple data pipe, if a particular element is canceled, that pipeline
> slot will just be empty, just like divpipecore. the control pipeline can
> keep track of which elements have valid data and which have been canceled.
> 
> > 
> > The multiplier code will now need to implement cancellation, which is a
> > global mask (not a register-propagated signal).
> the surrounding control hardware will just set the associated control
> signals such that the canceled/unused data elements are ignored.
> 


Which has the knock on ramifications of underutilised hardware (stages that run empty) which either decreases the IPC count or requires more RSs to conpensate.

Each RS comes with a penalty of thousands of gates as it is an entire extra row in the FU Regs Dependency Matrix.


> > 
> > Have a look at MaskCancellable.
> > 
> > The predicate mask (which is register based) and the stop mask (which is
> > global per pipe) are opaque to the (combinatorial) Stage API.
> 
> that's just fine. think of the multiplier as identical to a combinatorial
> stage except it takes more than one clock cycle for the data to get through.

Not going to fly.

Even without the PartitionSignal concept.
Comment 18 Jacob Lifshay 2019-08-14 23:18:19 BST
(In reply to Luke Kenneth Casson Leighton from comment #17)
> (In reply to Jacob Lifshay from comment #14)
> > (In reply to Luke Kenneth Casson Leighton from comment #12)
> > > (In reply to Jacob Lifshay from comment #11)
> > > 
> > > > > 
> > > > > If the code is not doing single cycle results we cannot use it.
> > > > 
> > > > yes we can, we just need to tell the pipeline API "this takes 3 stages
> > > > instead of one, so insert extra registers on the control signals"
> > > 
> > > Which still does not take care of cancellation.
> > 
> > it's a simple data pipe, if a particular element is canceled, that pipeline
> > slot will just be empty, just like divpipecore. the control pipeline can
> > keep track of which elements have valid data and which have been canceled.
> > 
> > > 
> > > The multiplier code will now need to implement cancellation, which is a
> > > global mask (not a register-propagated signal).
> > the surrounding control hardware will just set the associated control
> > signals such that the canceled/unused data elements are ignored.
> > 
> 
> 
> Which has the knock on ramifications of underutilised hardware (stages that
> run empty) which either decreases the IPC count or requires more RSs to
> conpensate.

it decreases IPC, which is what happens anytime an instruction is canceled, the partially completed instruction used (before it was known that it was to be canceled) hardware that could have been used to run other instructions had it known. Unfortunately, computers don't have a time machine to allow them to know which instructions end up being canceled ahead of time.

adding more RSs is not a solution even ignoring the extra scheduler hardware because each instruction has to finish each stage in sequence so the signals used by succeeding stages are calculated.

So, even if a element is canceled in a later stage, there aren't any instructions that could fill the unused slot because none of them have progressed far enough to be able to fill the slot.

basically, once the instructions are assigned to slots in the first stage, your stuck, there isn't (normally) a way to insert more instructions later to fill unused slots.

Note that the slots I'm referring to are element time slots.
Comment 19 Luke Kenneth Casson Leighton 2019-08-14 23:40:33 BST
(In reply to Jacob Lifshay from comment #18)
> (In reply to Luke Kenneth Casson Leighton from comment #17)
> > (In reply to Jacob Lifshay from comment #14)
> > > (In reply to Luke Kenneth Casson Leighton from comment #12)
> > > > (In reply to Jacob Lifshay from comment #11)
> > > > 
> > > > > > 
> > > > > > If the code is not doing single cycle results we cannot use it.
> > > > > 
> > > > > yes we can, we just need to tell the pipeline API "this takes 3 stages
> > > > > instead of one, so insert extra registers on the control signals"
> > > > 
> > > > Which still does not take care of cancellation.
> > > 
> > > it's a simple data pipe, if a particular element is canceled, that pipeline
> > > slot will just be empty, just like divpipecore. the control pipeline can
> > > keep track of which elements have valid data and which have been canceled.
> > > 
> > > > 
> > > > The multiplier code will now need to implement cancellation, which is a
> > > > global mask (not a register-propagated signal).
> > > the surrounding control hardware will just set the associated control
> > > signals such that the canceled/unused data elements are ignored.
> > > 
> > 
> > 
> > Which has the knock on ramifications of underutilised hardware (stages that
> > run empty) which either decreases the IPC count or requires more RSs to
> > conpensate.
> 
> it decreases IPC, which is what happens anytime an instruction is canceled,
> the partially completed instruction used (before it was known that it was to
> be canceled) hardware that could have been used to run other instructions
> had it known. 

That is correct... however by leaving the slots empty there is *yet more* penalty added.

The reason is because the stop mask is an unary representation of the binary Reservation Station Index.

If the index cannot be cleared because there are three extra clock delays until it clears out the end of the pipeline, that is three Reservation Stations that cannot be used.

When there are no Reservation Stations available the ENTIRE instruction issue engine has to freeze, solid.

So it is real severe consequences.

And is not conforming to a consistent API.

Or hugely complicates the API (for no good reason).

The multiply code needs the same code structure as div_core.
Comment 20 Luke Kenneth Casson Leighton 2019-08-14 23:43:58 BST
(In reply to Jacob Lifshay from comment #15)
> (In reply to Luke Kenneth Casson Leighton from comment #13)
> > Going to need some examples.
> 
> see
> https://salsa.debian.org/Kazan-team/simple-barrel-processor/blob/master/test/
> test_multiply.py#L65
> 
> the second constructor arg is converted to a PartitionPoints instance
> internally.

Ok, can those two signals (nibble, byte) be made runtime selectors? Or are they, already?

So only two bits perform the partitioning, not 64 bits.

Will take a closer look at the mul code later today after some rest.
Comment 21 Jacob Lifshay 2019-08-15 03:41:28 BST
(In reply to Luke Kenneth Casson Leighton from comment #20)
> (In reply to Jacob Lifshay from comment #15)
> > (In reply to Luke Kenneth Casson Leighton from comment #13)
> > > Going to need some examples.
> > 
> > see
> > https://salsa.debian.org/Kazan-team/simple-barrel-processor/blob/master/test/
> > test_multiply.py#L65
> > 
> > the second constructor arg is converted to a PartitionPoints instance
> > internally.
> 
> Ok, can those two signals (nibble, byte) be made runtime selectors? Or are
> they, already?
They are already.

> So only two bits perform the partitioning, not 64 bits.

actually, in that example, 3 bits perform the partitioning, one is just defined to be the OR of the other two.

Any single-bit nmigen Value will work.
Comment 22 Jacob Lifshay 2019-08-15 04:10:00 BST
(In reply to Luke Kenneth Casson Leighton from comment #19)
> (In reply to Jacob Lifshay from comment #18)
> > (In reply to Luke Kenneth Casson Leighton from comment #17)
> > > (In reply to Jacob Lifshay from comment #14)
> > > > (In reply to Luke Kenneth Casson Leighton from comment #12)
> > > > > (In reply to Jacob Lifshay from comment #11)
> > > > > 
> > > > > > > 
> > > > > > > If the code is not doing single cycle results we cannot use it.
> > > > > > 
> > > > > > yes we can, we just need to tell the pipeline API "this takes 3 stages
> > > > > > instead of one, so insert extra registers on the control signals"
> > > > > 
> > > > > Which still does not take care of cancellation.
> > > > 
> > > > it's a simple data pipe, if a particular element is canceled, that pipeline
> > > > slot will just be empty, just like divpipecore. the control pipeline can
> > > > keep track of which elements have valid data and which have been canceled.
> > > > 
> > > > > 
> > > > > The multiplier code will now need to implement cancellation, which is a
> > > > > global mask (not a register-propagated signal).
> > > > the surrounding control hardware will just set the associated control
> > > > signals such that the canceled/unused data elements are ignored.
> > > > 
> > > 
> > > 
> > > Which has the knock on ramifications of underutilised hardware (stages that
> > > run empty) which either decreases the IPC count or requires more RSs to
> > > conpensate.
> > 
> > it decreases IPC, which is what happens anytime an instruction is canceled,
> > the partially completed instruction used (before it was known that it was to
> > be canceled) hardware that could have been used to run other instructions
> > had it known. 
> 
> That is correct... however by leaving the slots empty there is *yet more*
> penalty added.
> 
> The reason is because the stop mask is an unary representation of the binary
> Reservation Station Index.
> 
> If the index cannot be cleared because there are three extra clock delays
> until it clears out the end of the pipeline, that is three Reservation
> Stations that cannot be used.

The index can be cleared right away and the Reservation Stations reused. All that is needed is to have a muxid of 0 (or some other appropriate value or a muxid_valid signal) mean that there is no instruction there so the output should be ignored.

Since the Reservation Stations can be reused right away, the only part that is underutilized is the canceled pipeline slots, which is unavoidable, since those slots can't be reused due to instructions needing to go through the pipeline stages in sequence.

> The multiply code needs the same code structure as div_core.

I agree, however, I think that the div_core is unnecessarily complicated by all the extra wrapping for each stage and that it should be more like the multiplier and have the data registers internally.

To allow generic registers, the constructor could have a function passed in that creates a register for a particular stage. That would allow stuff like passing in a function that creates registers for every other stage (to handle 2 computational stages per pipeline stage) or handles creating the muxes and registers needed for inserting registers only when the processor is in high-speed mode (as discussed earlier).

The whole idea is that the computation code should be simple and easy to read without all the control signals being required to be in the same Module.

That might also have benefits if the synthesizer lays out logic according to which module it's in, allowing all the control pipeline to be moved away from the data pipeline when the data pipeline doesn't need to know what the control pipeline is doing at every stage.

Note that the multiplier won't specify how many pipeline stages it can take, since that number can vary anywhere from 1 (entirely combinatorial) to about 18 and it is totally up to the instantiating module.
Comment 23 Luke Kenneth Casson Leighton 2019-08-15 06:55:24 BST
(In reply to Jacob Lifshay from comment #21)

> actually, in that example, 3 bits perform the partitioning, one is just
> defined to be the OR of the other two.
> 
> Any single-bit nmigen Value will work.

I took a quick look late last night, it looks great.  Actually, the partitioning mask on its own is perfect, and the vectorisation system would perform the decode/allocation.

Remember that to do 8 bit vectors without a read modify write @ the 64 bit regfile level we need *byte* level WEn signals.

That means EIGHT WEn signals to write a 64 bit register value.

To avoid overloading things the regfile is split HI LO 32 odd even. 4 banks.

So most FP32 vector operations will be only 4 WEn wires wide.

So the fact that the PartitionPoints can be at the byte level is perfect.

I am still uncertain whether - or how - to do 64 bit "sharing" of ALUs across 2 Reservation Stations.

The PartitionedSignal idea, if set at the 64 bit level, might actually break the planned 4 bank 32 bit,HILO,oddeven regfile idea.

It might still work by having two RSs per PartitionedALU.  32 bit input operands would split the PartitionPoint down the middle, and as the two halves would be independent, there is no timing issue.

Heck it might even work to have *byte* level RSs, as originally envisaged back in... November / December 2018.

(I scrapped that idea because the cost of the FURegs DM gate count is far too high. 32 bit is just barely tolerable).

The Vector issue is thus responsible for keeping an eye on resource allocation.  A pair of "unused" 16 bit partitions can, with the predicate masks, allocated an operation.

This however needs a lot of thought. The DMs were initially supposed to help at this fine grain level, one DM to cover 32 bit and another to cover 8 bit, then pairs of 32 ReservationStations collaborate to cover 64 bit and pairs of 8 bit RSs collaborate to cover 16 bit.

It still *might* be ok to have a tiny (narrow) 8 bit FURegs Dependency Matrix. Only 8 entries wide.  Have to see.
Comment 24 Luke Kenneth Casson Leighton 2019-08-15 08:17:10 BST
(In reply to Jacob Lifshay from comment #22)
.
> > 
> > If the index cannot be cleared because there are three extra clock delays
> > until it clears out the end of the pipeline, that is three Reservation
> > Stations that cannot be used.
> 
> The index can be cleared right away and the Reservation Stations reused. All
> that is needed is to have a muxid of 0 (or some other appropriate value or a
> muxid_valid signal) mean that there is no instruction there so the output
> should be ignored.
>

The unary version of the mask is exactly that "muxid_valid" signal, and no, it cannot be used in the way that you envisage, because the muxid and the masks travel in sync with the data.

They do not "emerge" from the pipeline the instant that the [global] stop mask signal is raised.

What would be needed instead wouls be to use the scheme deployed by IIT Madras.  They had a muxid that was HIGHER than the number of RSs, and did "counting". Each issued operation got a unique index.

The maximum count is twice the length of the pipeline and is a rolling count.

By having a count that is twice the pipeline depth it becomes possible to distinguish cancelled operations from uncancelled ones.

The problem is: it's a *binary* system. Adding multi issue on top of a *binary* counter is hell.

We are critically relying on multi issue capability.

Multi issue on a unary system is, believe it or not, trivial.

Mitch explained that all you do is transitively accumulate read and write dependency hazards - multiple bits in the unary masks/regmaps - and the DMs take care of the rest.

Incredibly, that's *it*!  We get to take on Intel and ARM, thanks to Mitch's input.

If we deploy binary counters, we literally destroy the underpinnings of the entire architecture.


> Since the Reservation Stations can be reused right away,

No, they can't. Not without having the cancellation logic and the global stop mask.

I did think of a solution.

The Pipeline API was designed to be sync *and* comb based.

The Stage API is DEFINITELY intended to be combinatorial based ONLY.

The Pipeline API was NEVER designed to allow Stages to be sync based, and I strongly disagree that it should be extended to permit such.

That said, there is a way it can be done. Bear with me, because whilst it can be done, there is still no point doing it.

The way it can be done without completely trashing the Pipeline API and the entire design strategy behind it is to create "fake" data that doesn't actually get set.

Pipeline Stage 1 (a ControlBase derivative) would be given an object that we KNEW in advance was hardwired to pass its data to Pipeline Stage 3.

The stage instance conforms to the Stage API, its ospec however returns an EMPTY OBJECT that has an eq function that does ABSOLUTELY NOTHING.

You can probably see where that is going.

So whilst the ControlBase instances (MaskCancellables) handle the control logic, the data goes through the pipeline and emerges back at the 3rd stage.

There are several problems with this approach.

(1) Deploying the FIFO based Pipeline handlers is out of the question.

(2). With the data going through an "unmanaged" line, anything that requires stalling or other delay mechanisms (ready being DEASSERTED) is fundamentally in trouble.

(3) early-in and early-out are also eliminated from consideration.

To "fix" these problems and regain the functionality, the ENTIRE PIPELINE API MUST BE DUPLICATED WITHIN THE MULTIPLY BLOCK.

A duplicated early-in and early-out system must be added to the multiplier code.

A duplicated FIFO system must be added to the multiplier code.

A ready/valid signalling system must be added to the multiplier code.

The simplest and quickest way to achieve that: conform to the *already written* Pipeline and Stage APIs.


> the only part that
> is underutilized is the canceled pipeline slots, which is unavoidable, since
> those slots can't be reused due to instructions needing to go through the
> pipeline stages in sequence.
> 
> > The multiply code needs the same code structure as div_core.
> 
> I agree, however, I think that the div_core is unnecessarily complicated by
> all the extra wrapping for each stage and that it should be more like the
> multiplier and have the data registers internally.

No. The suggestion is very alarming, for several reasons.

Firstly, the graphviz images are *really* clean, right now. I highly recommend taking a look at eg the fprsqrt16 ILANG, walking through from "show top", then alu, then pipediv3.

At pipediv3 you see the combinatorial chain of divpipecorecalculate5 (worth also looking at), divpipecorefinal, and my ineptly named "div1" which I will fix some time.

Above that you have a suite of "IO Management" signals that handle ready/valid and mask/stop.

Those "handle" the registers and in NO WAY interact with the actual data.

When we get to doing the INT/FP thing, div1 will be on a bypass path, involving earlyin/out.

(2) carrying on from the above, div1 cleanly handles the INT to FP conversion. div0 likewise.

These modules are NOT superfluous to requirements.

They provide CLEAN and SIMPLE subdivision that makes the design easy and clear to understand.

(3) The subdivision into simple modules has a knock on simplification effect on the layout / routing.

With wires being specifically routed through blocks that handle them, the job of doing the actual layout becomes much easier.

Running a global autorouter requires insane computational resources and produces a hellish rats nest that is indistinguishable from TV "snow".

(3) major structural redesigns, whilst they are something that is going to be needed, *really* should not be attempted unless absolutely absolutely necessary.

"Plan for 3 revisions because that's what you will be doing anyway".

1st version: you don't know what you want, and you don't know how to do it.

2nd version: by previously implementing something that *fails* to have the features you want, you now implement what you *do* want, but still have no idea how to do it.

3rd version: you already knew what you wanted, but by failing to know how to do it in the 2nd version, you now *know* how to do it better, so now you do it.

Agile is supposed to blur those lines.

I am barely keeping track, and, more than that, doing major redesigns - moving to a 2nd or 3rd revision - is going to cause us to fail to meet the milestones within the budget that we have been given.

(4) we spent 3 months on discussion of the Pipeline API: it is now just about within the threshold of understandability: adding mask/stop is making a huge mess of the code already, and I *REALLY* do not want to complicate it further with extra "features" in what is a 1st revision concept.

*next* funding round.
Not this one.
Please.


> To allow generic registers, the constructor could have a function passed in
> that creates a register for a particular stage. That would allow stuff like
> passing in a function that creates registers for every other stage (to
> handle 2 computational stages per pipeline stage) or handles creating the
> muxes and registers needed for inserting registers only when the processor
> is in high-speed mode (as discussed earlier).

I loved that idea, however liking it and implementing it are two radically different things.

It needs to go on the "enhancements" list, shelved for another funding round.

Unfortunately.  That said, it *might* be possible to use the Pipeline/Stage API as-is, with very little in the way of modification.

I just do not think it is a wise thing to explore _right now_ when we have limited time, people and funds.

> The whole idea is that the computation code should be simple and easy to
> read without all the control signals being required to be in the same Module.

Yes! This is what the Pipeline API and Stage API do! *Please*, look at the graphviz diagram for divpipecalculate5 and you will see that it is precisely and exactly that!

> That might also have benefits if the synthesizer lays out logic according to
> which module it's in, allowing all the control pipeline to be moved away
> from the data pipeline when the data pipeline doesn't need to know what the
> control pipeline is doing at every stage.

That is *precisely* and *exactly* what we have *right now* and the multiply unit *does not conform to it*.

> Note that the multiplier won't specify how many pipeline stages it can take,
> since that number can vary anywhere from 1 (entirely combinatorial) to about
> 18 and it is totally up to the instantiating module.

We cannot do early in early out, we cannot add stalls (for inorder implementors), I even had an FSM system working at one point *all using the Stage API and reusing the exact same combinatorial modules*

All of those options and many more are DESTROYED by not having the multiplier code conform to the Stage API.

Please understand that it is critical that the code conform to the Stage API, and that means that the use of sync must be prohibited, and sync *only* permitted to be deployed by a class that derives from ConrrolBase.

Unit tests doing sync is absolutely fine. I noticed in test_core.py that the code you put together is a near-duplication of ControlBase.connect, and that's absolutely fine.

However for production purposes, the consequences of not conforming to the Pipeline API are really severely limiting, cause far more complications than appear on first glance, and could jeapordise our chances of success.

I've tried to outline as many of those things as I can, above. But please, *look* at the graphviz diagrams, it is critically important that you understand how the PipelineAPI works.

I tried to encourage you to do the FPDIV code so that you would begin to understand how it works: you did not take that opportunity, and so do not yet fully understand it or accept it.  This is causing problems which we do not need.
Comment 25 Luke Kenneth Casson Leighton 2019-08-15 08:56:30 BST
Btw one very important thing to distinguish/discern, the PartitionedSignal idea is not for use *in* the multiply unit, it's the other way round: PartitionedSignal.__mul__ would *use* the multiply code.
Comment 26 Luke Kenneth Casson Leighton 2019-08-15 11:03:27 BST
So the takeaway summary is: anything that violates the Stage API (uses sync) can no longer use any of the (many) derivatives from ControlBase.

The multiply unit is not conforming to the API and that severely curtails and restricts its usefulness, and *will* require *massive* duplication of already existing proven and tested pipeline classes.

I will do some diagrams and a video, later to show how MaskCancellable and MultiIn and MultiOut are the core basis for doing the entire ALU: earlyin, earlyout, loopbacking, ptedication, speculative execution, everything.

Also if we have partitionable Signals, we may even be able to *split* early in and early out partial results (route part of a mask down one path and the rest down another) then recombine them back in a MultiIn.

Have to think that one through carefully though. It may be the case that we have to keep partitioned data together, no matter what.
Comment 27 Luke Kenneth Casson Leighton 2019-08-17 05:34:09 BST
I renamed the FPDIV classes and the submodule, calling them FPDivPreFPAdjust and FPDivPostToFPFormat.

Also the transparent latch idea is apparently something that has been done before, successfully, and was quite easy to add, see bug #133.

The Pipeline / StageAPI supports FSMs, combinatorial chaining, actual pipeline registers, now also flexible (transparent) dynamic registers, and anything else that a *future* user of this API can envisage.

Use of sync terminates all and any possibity of the use of this extremely flexible, straightforward API and attempting to do so requires special cases and/or massive complications.

We do not have time for that, and it is not desirable anyway.
Comment 28 Luke Kenneth Casson Leighton 2019-08-17 06:45:22 BST
Looking at test_multiply, it is quite hard to track things, it would appear that the argument "register_levels" is the key.

Is it the case that if register_levels is set to contain one single value that code to create one "stage" of a pipeline will be created?

Is it then the case that if an EXTERNAL for loop was put together, with each iteration of that loop creating an instance of a Mul_8_16_32_64 with one and only one "level" in each, and these "levels" were then manually connected together, that a full pipelined Mul would be created?

If so, there is very little actual work needed to be done to get this code to conform to the Stage API.

In addition, the graphviz diagrams would become much clearer, as each cycle would be under its own module.

The last time I checked the graphviz it took xdot about 30 seconds @ 100% CPU to process (a bad sign), the framerate on display was about 3 seconds per frame (not 3 frames per second), and it was impossible to make out the gates at full screen, there were just far too many connections and blocks.

So it has to be split up anyway.

A parameter can be added to request that sync *not* be used, with an assert to check that len(register_levels) == 1 if that is the case.
Comment 29 Luke Kenneth Casson Leighton 2019-08-24 20:05:10 BST
I managed to rework the multiplier code, will write a test tomorrow.

Whilst the ALU parts of this experiment are easy enough, control and decision making is not. Mux can be made a class member of the partitioned signal, however m.If and m.Case cannot.

Fortunately i think Case is never used except in things like global operands, however m.If and Else are used all over the place.

I haven't yet fully thought through how m.If when made partitionable would even work.

It would need to hide a set of *adaptive* boolean tests, where the objects within the context would also need the same partitioning.

It is actually quite fascinating.
Comment 30 Jacob Lifshay 2019-08-25 07:24:19 BST
(In reply to Luke Kenneth Casson Leighton from comment #29)
> I managed to rework the multiplier code, will write a test tomorrow.
> 
> Whilst the ALU parts of this experiment are easy enough, control and
> decision making is not. Mux can be made a class member of the partitioned
> signal, however m.If and m.Case cannot.
> 
> Fortunately i think Case is never used except in things like global
> operands, however m.If and Else are used all over the place.
> 
> I haven't yet fully thought through how m.If when made partitionable would
> even work.
> 
> It would need to hide a set of *adaptive* boolean tests, where the objects
> within the context would also need the same partitioning.
> 
> It is actually quite fascinating.

I actually think it may be better to refactor the code to use Mux instead of m.If and m.Else:

1. we won't have to modify nmigen's Module

2. it appears as though we have been subconsciously thinking that m.If somehow makes the logic inside the m.If less expensive when the condition is dynamically disabled. Having an explicit Mux may fix that.

3. we will need to review and, if needed, rewrite code anyway when converting to partitioned form. rewriting to Mux may help us avoid missing any code.

for part 3, we will need PartitionedValue to not be derived from nmigen's Value to allow nmigen to catch mixups.

It may also be a good idea to use a slightly different name for Mux since they are different operations, how about MuxP for Mux-partitioned? that would allow using a free function/class instead of a member.
Comment 31 Luke Kenneth Casson Leighton 2019-08-25 08:41:24 BST
(In reply to Jacob Lifshay from comment #30)

> I actually think it may be better to refactor the code to use Mux instead of
> m.If and m.Else:

interesting. why? because, ultimately, yosys "proc" and "opt" end up replacing
m.If/Else with Muxes anyway.

> 1. we won't have to modify nmigen's Module

i was thinking along the lines of a derived class (or a wrapper that takes
Module as an argument).

> 2. it appears as though we have been subconsciously thinking that m.If
> somehow makes the logic inside the m.If less expensive when the condition is
> dynamically disabled. Having an explicit Mux may fix that.

i wasn't thinking :)

here's a simple example:

-        with m.If(self.i.product[-1]):
-            comb += p.eq(self.i.product)
-        with m.Else():
-            # get 1 bit of extra accuracy if the mantissa top bit is zero
-            comb += p.eq(self.i.product<<1)
-            comb += self.o.z.e.eq(self.i.z.e-1)
+        msb = Signal(reset_less=True)
+        e = self.o.z.e
+        comb += msb.eq(self.i.product[-1])
+        comb += p.eq(Mux(msb, self.i.product, self.i.product<<1))
+        comb += e.eq(Mux(msb, self.i.z.e, self.i.z.e-1))

that would do alright, wouldn't it? 

my only concern is: where logically, the actions used to be "grouped"
together (the "If" things are grouped together), now they're separated.

> 3. we will need to review and, if needed, rewrite code anyway when
> converting to partitioned form. rewriting to Mux may help us avoid missing
> any code.

yes, as "grep Mux" would provide the identification-points.

for example, the msb above now clearly needs to be a "multi-partitioned"
Signal.

> for part 3, we will need PartitionedValue to not be derived from nmigen's
> Value to allow nmigen to catch mixups.

hmmm hm hm hm... maaybe.

> It may also be a good idea to use a slightly different name for Mux since
> they are different operations, how about MuxP for Mux-partitioned? that
> would allow using a free function/class instead of a member.

yehyeh, that's a good idea.

sigh this is a big change, which can't be tested (live) without actually
committing entirely to it.

can you think of a way to gain some confidence that this is going to work,
*before* going ahead with the whole lot?
Comment 32 Jacob Lifshay 2019-08-25 09:08:02 BST
(In reply to Luke Kenneth Casson Leighton from comment #31)
> (In reply to Jacob Lifshay from comment #30)
> 
> > I actually think it may be better to refactor the code to use Mux instead of
> > m.If and m.Else:
> 
> interesting. why? because, ultimately, yosys "proc" and "opt" end up
> replacing
> m.If/Else with Muxes anyway.
> 
> > 1. we won't have to modify nmigen's Module
> 
> i was thinking along the lines of a derived class (or a wrapper that takes
> Module as an argument).

I think wrapping or deriving from Module may make it unnecessarily extremely complex.

> 
> > 2. it appears as though we have been subconsciously thinking that m.If
> > somehow makes the logic inside the m.If less expensive when the condition is
> > dynamically disabled. Having an explicit Mux may fix that.
> 
> i wasn't thinking :)
> 
> here's a simple example:
> 
> -        with m.If(self.i.product[-1]):
> -            comb += p.eq(self.i.product)
> -        with m.Else():
> -            # get 1 bit of extra accuracy if the mantissa top bit is zero
> -            comb += p.eq(self.i.product<<1)
> -            comb += self.o.z.e.eq(self.i.z.e-1)
> +        msb = Signal(reset_less=True)
> +        e = self.o.z.e
> +        comb += msb.eq(self.i.product[-1])
> +        comb += p.eq(Mux(msb, self.i.product, self.i.product<<1))
> +        comb += e.eq(Mux(msb, self.i.z.e, self.i.z.e-1))
> 
> that would do alright, wouldn't it? 

that looks fine to me.

> 
> my only concern is: where logically, the actions used to be "grouped"
> together (the "If" things are grouped together), now they're separated.

I think that most of the places where there's a large amount of code in a If block (can't be replaced with a few Muxs), the code should have probably been refactored anyway. the rest of the cases can be replaced with a simple mux signal/variable like you did above.

> 
> > 3. we will need to review and, if needed, rewrite code anyway when
> > converting to partitioned form. rewriting to Mux may help us avoid missing
> > any code.
> 
> yes, as "grep Mux" would provide the identification-points.

grep Mux won't provide all the identification points since there is probably some code that would need to be refactored that doesn't contain Mux or If.

> 
> for example, the msb above now clearly needs to be a "multi-partitioned"
> Signal.
> 
> > for part 3, we will need PartitionedValue to not be derived from nmigen's
> > Value to allow nmigen to catch mixups.

another reason to have PartitionedValue and Value separate is that we need scalar (non-partitioned) as well as simd (partitioned) values in the same code.

> 
> hmmm hm hm hm... maaybe.
> 
> > It may also be a good idea to use a slightly different name for Mux since
> > they are different operations, how about MuxP for Mux-partitioned? that
> > would allow using a free function/class instead of a member.
> 
> yehyeh, that's a good idea.
> 
> sigh this is a big change, which can't be tested (live) without actually
> committing entirely to it.
> 
> can you think of a way to gain some confidence that this is going to work,
> *before* going ahead with the whole lot?

try it on divcore, which is relatively small but is complex enough to fully utilise the operations we will need on partitioned signals.

the divcore tests can be adapted to work with partitioned signals relatively simply.
Comment 33 Jacob Lifshay 2019-08-25 09:12:45 BST
(In reply to Jacob Lifshay from comment #32)
> (In reply to Luke Kenneth Casson Leighton from comment #31)
> > 
> > can you think of a way to gain some confidence that this is going to work,
> > *before* going ahead with the whole lot?
> 
> try it on divcore, which is relatively small but is complex enough to fully
> utilise the operations we will need on partitioned signals.
> 
> the divcore tests can be adapted to work with partitioned signals relatively
> simply.

maybe copy divcore and divcore tests to a new directory and work on it there?

that way we can keep the working code separate in case we fail the first time.

if making a PartitionedValue class doesn't work, we could fall back on the method used in the partitioned multiplier -- write the code explicitly and just accept that it will be complex.
Comment 34 Luke Kenneth Casson Leighton 2019-08-25 09:38:37 BST
(In reply to Jacob Lifshay from comment #32)

> > here's a simple example:
> > 
> > -        with m.If(self.i.product[-1]):
> > -            comb += p.eq(self.i.product)
> > -        with m.Else():
> > -            # get 1 bit of extra accuracy if the mantissa top bit is zero
> > -            comb += p.eq(self.i.product<<1)
> > -            comb += self.o.z.e.eq(self.i.z.e-1)
> > +        msb = Signal(reset_less=True)
> > +        e = self.o.z.e
> > +        comb += msb.eq(self.i.product[-1])
> > +        comb += p.eq(Mux(msb, self.i.product, self.i.product<<1))
> > +        comb += e.eq(Mux(msb, self.i.z.e, self.i.z.e-1))
> > 
> > that would do alright, wouldn't it? 
> 
> that looks fine to me.

yeh.  oh, except i realised: arrays.  the next bit of the code is as follows:

        mw = self.o.z.m_width
        comb += [
            self.o.z.m.eq(p[mw+2:]),            # mantissa
            self.o.of.m0.eq(p[mw+2]),           # copy of LSB
            self.o.of.guard.eq(p[mw+1]),        # guard
            self.o.of.round_bit.eq(p[mw]),      # round
            self.o.of.sticky.eq(p[0:mw].bool()) # sticky
        ]

the assumption here is that the mantissa width is a fixed parameter.

we'll need a P-variant of "Part" which takes a *list* of partition-selectable
widths/offsets - is that making any sense?

this is going to be quite intrusive, however, ultimately, i think if we
tried to write an "explicit" partitioned version of this code (with for-loops
etc.) it would quickly become not just unreadable but also apparent that
the process is extremely regular and monotonous.
Comment 35 Jacob Lifshay 2019-08-25 09:47:05 BST
(In reply to Luke Kenneth Casson Leighton from comment #34)
> (In reply to Jacob Lifshay from comment #32)
> 
> > > here's a simple example:
> > > 
> > > -        with m.If(self.i.product[-1]):
> > > -            comb += p.eq(self.i.product)
> > > -        with m.Else():
> > > -            # get 1 bit of extra accuracy if the mantissa top bit is zero
> > > -            comb += p.eq(self.i.product<<1)
> > > -            comb += self.o.z.e.eq(self.i.z.e-1)
> > > +        msb = Signal(reset_less=True)
> > > +        e = self.o.z.e
> > > +        comb += msb.eq(self.i.product[-1])
> > > +        comb += p.eq(Mux(msb, self.i.product, self.i.product<<1))
> > > +        comb += e.eq(Mux(msb, self.i.z.e, self.i.z.e-1))
> > > 
> > > that would do alright, wouldn't it? 
> > 
> > that looks fine to me.
> 
> yeh.  oh, except i realised: arrays.  the next bit of the code is as follows:
> 
>         mw = self.o.z.m_width
>         comb += [
>             self.o.z.m.eq(p[mw+2:]),            # mantissa
>             self.o.of.m0.eq(p[mw+2]),           # copy of LSB
>             self.o.of.guard.eq(p[mw+1]),        # guard
>             self.o.of.round_bit.eq(p[mw]),      # round
>             self.o.of.sticky.eq(p[0:mw].bool()) # sticky
>         ]
> 
> the assumption here is that the mantissa width is a fixed parameter.
> 
> we'll need a P-variant of "Part" which takes a *list* of partition-selectable
> widths/offsets - is that making any sense?

yup, makes sense to me.

> 
> this is going to be quite intrusive, however, ultimately, i think if we
> tried to write an "explicit" partitioned version of this code (with for-loops
> etc.) it would quickly become not just unreadable but also apparent that
> the process is extremely regular and monotonous.

yeah. it's only good as a backup option, since it would still work despite the verbosity.
Comment 36 Luke Kenneth Casson Leighton 2019-08-25 21:52:34 BST
ok so except for fcvt and fclass, the removal of m.If/Elif/Else has been
done, replacing with Mux.  specialcases are exceptionally annoyingly weirdly
encoded, now: none of them are particularly obvious.  i used a cumulative
(unbalanced) mux-tree.

fpbase is going to become quite weird: it's where all the magic constants
exist, and where things like NaN and Inf, and denormalisation are detected.
i may just write a partitioned version of FPNumBaseRecord that has the
same API as FPNumBaseRecord.
Comment 37 Luke Kenneth Casson Leighton 2019-08-28 07:26:32 BST
pipeline unit test on Mul_8_16_32_64 works.
Comment 39 Luke Kenneth Casson Leighton 2020-01-07 07:56:10 GMT

On Tuesday, January 7, 2020, Immanuel, Yehowshua U <yimmanuel3@gatech.edu> wrote:
> exactly how that's done given that Signal cannot at the moment be
> derived from (due to whitequark trying to early-optimise through use
> of "type", and not properly respect python coding conventions and
> warnings *sigh*) can i leave it to you to investigate?

Yes. I will add this to the pipeline.
Shouldn’t be too hard to come up with an inheritable structure for Partitioned signals to be used with FMAs.

superb.

the ideal situation is this:

class Foo(Elaboratable):
def __init__ (self, SigKls):
     self.a = SigKls(32)

def elaborate(self, plat):
    ....
    m.comb += self.a + Const(5)


where we literally pass in Signal *or* PartitionedSignal as a parameter, and regardless of which, the code will still work.

that means that PartitionedSignal.__add__ etc return some codefragments that perform the partitioning.

actually, the first priority might just be to do a straight "passthrough" class which pretty much only wraps Signsl's operators and constructor.

doesn't do any partitioning, just acts exactly like Signal, initially.
Comment 40 Luke Kenneth Casson Leighton 2020-01-24 12:20:59 GMT
i've started on the PartitionedSignal class, and have plumbed in PartitionedAdder
and confirmed that it works:

https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/partsig.py
https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/test/test_partsig.py

it's relatively straightforward (mainly thanks to jacob already having
created PartitionedAdder).  am currently doing "eq" which will be an
override of Signal.__eq__.

https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_cmp/equal.py

this is slightly tricky because the response changes depending on the
partitioning, i.e. represents either a BOOL (single-value) or an *array*
of BOOLs (SIMD comparisons).

this kinda has to be defined otherwise we get into a bit of a mess.
Comment 41 Jacob Lifshay 2020-01-24 15:11:00 GMT
(In reply to Luke Kenneth Casson Leighton from comment #40)
> https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_cmp/
> equal.py
> 
> this is slightly tricky because the response changes depending on the
> partitioning, i.e. represents either a BOOL (single-value) or an *array*
> of BOOLs (SIMD comparisons).
> 
> this kinda has to be defined otherwise we get into a bit of a mess.

Probably a good idea to think of Signal as a scalar and PartitionedSignal as always a vector (maybe we should rename to VectorSignal?). That will make it much easier to keep track of.
Comment 42 Luke Kenneth Casson Leighton 2020-01-26 12:18:49 GMT
from the work i did at Aspex they called this concept "Array Strings" because they were partitionable strings of arrays of ALUs.  1 bit ALUs initially however they went to 2 bit later.

in my mind vector is kinda clinical and could also be easily be confused with toplevel ISA vectors.  imagine doing a grep on the codebase looking for vector processing and finding VectorSignal, thinking it was part of the ISA.
Comment 43 Luke Kenneth Casson Leighton 2020-01-26 12:26:51 GMT
https://libre-riscv.org/3d_gpu/architecture/dynamic_simd/eq/

i had to write out the boolean truth table. it appears to basically follow the pattern of predicated findfirst (a vector opcode commonly found in vector ISAs).

for bit zero of the output, on the first zero of the partition, starting from MSB, the eqs are & daisychained, stopping when a "1" in the partition bits are encountered.

for bit 1 of the output, if the previous (higher numbered) partition bit was zero, then so is the output. otherwise it starts a new daisychain of &s

this is the "findfirst" aspect of the table.

i can do a crude version of this fairly easily, it will look awful and be horrendously suboptimal and critically rely on yosys boolean algebraic optimisation.

it would however be nice for this to be a leetle bit cleaner
Comment 44 Luke Kenneth Casson Leighton 2020-01-30 18:53:26 GMT
https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_cmp/equal.py;h=2fdaabf46bde5284237ee6b0c0ff0efb00222f31;hb=4977861f9fbf1f48c4e5aa4362a56aff58efe7a8

hooray! working eq function!  stunning, i'm amazed.  the graphviz is a
total dog's dinner however actually having something working is really
critical.

some help reviewing this to see if there's a better way to do it really
appreciated.
Comment 45 Luke Kenneth Casson Leighton 2020-02-05 16:33:52 GMT
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-February/003820.html

link to discussion on alternative eq and gt/lt
Comment 46 Luke Kenneth Casson Leighton 2020-02-05 17:40:30 GMT
michael a quick test added to this file:
"ieee754/part_cmp/eq_gt_ge.py"

shows that the output order is inverted compared to what
test_partsig is expecting.
Comment 47 Luke Kenneth Casson Leighton 2020-02-06 14:30:37 GMT
ha!  cool!  nicely done, michael, that's the test_partsig.py working well.
i just added cut/paste style (plenty of code-duplication) a GT test, will
add GE as well then drop in LT, LE, NE on top of those.
Comment 48 Luke Kenneth Casson Leighton 2020-02-06 14:58:28 GMT
ok that's done, that's the six operators, le, lt, ne, gt, ge, eq.
cool!
Comment 49 Luke Kenneth Casson Leighton 2021-10-02 17:10:30 BST
nggh i just realised, the reversed function __radd__ etc need to be a lot
more sophisticated, because they can take a Signal (or Const, or any AST Expression) on the rhs, but the return result must be a PartitionedSignal
otherwise what is the thing doing??

which means that where in some cases the underlying code (PartitionedShift)
assumes that it is the *thing being shifted* that may be scalar here
we have a situation where it is the *shift amount* that is scalar.

or maybe that is the other way round.

either way this needs to be looked at and taken into consideration.
bugreport underway....


the other one (already raised) is __getitem__ and Slice, bug #716 which
also need overriding.  maybe. __getitem__ definitely, Part definitely,
Slice perhaps not because it is i believe supposed to be integers
(compile time) only.  dynamic selection uses Array and we are not
tackling that yet.  too complex.