Bug 60 - N-stage 64-bit multiplier pipeline needed (signed/unsigned)
Summary: N-stage 64-bit multiplier pipeline needed (signed/unsigned)
Status: PAYMENTPENDING FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: ALU (including IEEE754 16/32/64-bit FPU) (show other bugs)
Version: unspecified
Hardware: All All
: --- enhancement
Assignee: Jacob Lifshay
URL:
Depends on: 84
Blocks: 49 77
  Show dependency treegraph
 
Reported: 2019-04-13 11:49 BST by Luke Kenneth Casson Leighton
Modified: 2020-12-06 14:02 GMT (History)
3 users (show)

See Also:
NLnet milestone: NLnet.2019.02.012
total budget (EUR) for completion of task and all subtasks: 2500
budget (EUR) for this task, excluding subtasks' budget: 2500
parent task for budget allocation: 77
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
"jacob"={amount=2500, paid=2019-06-04}


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2019-04-13 11:49:50 BST
a 2-stage 64-bit integer multiplier is needed,
with capability to support:
* signed x signed
* signed x unsigned 
* unsigned x unsigned
Comment 1 Jacob Lifshay 2019-04-15 08:36:58 BST
I'm planning to design it to have integer fma capabilities, because adding 1 more add into the carry-save adders has less latency than a separate adder. I'm also planning to design it to be splittable to 1x64, 2x32, 4x16, and 8x8-bits, for use in the SIMD FPU/ALU.

I had been planning on multiplication taking 3 or 4 cycles, so, for the FPU, the multiplier will take part of the first cycle (leaving space for shifting mantissas into place) and part of the last cycle (leaving space to normalize) and all of the intermediate cycles.

If we plan on multiplication taking only 2 cycles, I'm worried that the latency would be high enough that we can't clock it very fast (I'd like to design the latency to allow us to go to at least 1.5GHz or so if we ignore our power limits).

In the barrel processor design, I had already designed for a 3-stage multiplier, though that can be relatively easily changed at this point.
Comment 2 Luke Kenneth Casson Leighton 2019-04-15 08:42:38 BST
3-4 it is.
Comment 3 Luke Kenneth Casson Leighton 2019-04-15 10:36:57 BST
the shakti group i know did the 64-bit multiply with a series of 8-bit
multipliers, then added up the intermediate results.  mitch alsup mentioned
on comp.arch that there is a modern algorithm for extremely efficient
integer-multiply, where we should i feel assume that below a certain
granularity that we have access to such.  not least, for FPGA demos,
FPGAs have DSP blocks.

in the IEEE754 FPU code it is extremely easy to instantiate extra units,
and creating a SIMD pipeline will be a very very straightforward matter
of declaring an array of pipelines, and using Cat() to split the incoming
data and push it to the associated unit then reassemble on exit.  some
synchronisation will be required (if we allow early-out), however that
would not be difficult to achieve.

so the primary need then will be flexibility (parameterisation) on a
single block (as opposed to designing a parallel/SIMD engine): being
able to specify the two key parameters of that single block, namely:

* the required pipeline depth
* the bit-width

i say that because if a third parameter is provided (amount of parallelism)
it may actually become inconvenient to deploy in the IEEE754 FPU, as the
parallelism would be achieved at a level way above the actual integer
multiplier component.

for an integer-based ALU, on the other hand, that *will* need parallelism,
however i feel that such a "parallelising" or "array" capability is still
a separate task, needed for many many other purposes i.e. SIMD-ifiying in
general, for ADD, SHIFT, FMAC... everything in fact :)
Comment 4 Luke Kenneth Casson Leighton 2019-04-29 04:36:24 BST
cooooool, found the src/multiplier.py code - recursive code-generation
(AddReduce), cooool :)

        self.partition_points = dict((
            (index, Value.wrap(value))
            for index, value in partition_points.items()))

can be done as:

        for idx, value in self.partition_points.items():
            self.partition_points[idx] = Value.wrap(value)

there's probably a... "nicer" way to do it, just can't think of it.
Comment 5 Luke Kenneth Casson Leighton 2019-05-01 18:49:36 BST
    def eq(self, rhs: 'PartitionPoints') -> Iterable[Assign]:
        if set(self.keys()) != set(rhs.keys()):
            raise ValueError("incompatible point set")
        for point, enabled in self.items():
            yield enabled.eq(rhs[enabled])


dict.items returns key, value - so that will need to be
yiield enabled.eq(rhs[point])

using the words "key, value" - as in the commonly-used phrase
"key-value pairs" - helps make it clear.

yield value.eq(rhs[key])
Comment 6 Jacob Lifshay 2019-05-01 18:55:16 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)
>     def eq(self, rhs: 'PartitionPoints') -> Iterable[Assign]:
>         if set(self.keys()) != set(rhs.keys()):
>             raise ValueError("incompatible point set")
>         for point, enabled in self.items():
>             yield enabled.eq(rhs[enabled])
> 
> 
> dict.items returns key, value - so that will need to be
> yiield enabled.eq(rhs[point])
> 
> using the words "key, value" - as in the commonly-used phrase
> "key-value pairs" - helps make it clear.
> 
> yield value.eq(rhs[key])

oops, fixed
Comment 7 Jacob Lifshay 2019-05-17 07:20:18 BST
I implemented the multiplier, it passes all tests.
I don't have any tests for having pipeline registers in the multiplier yet.
I also didn't yet add the adder input for FMA, that should be relatively trivial.
I also need to add additional documentation.

The multiplier supports every combination of aligned 8/16/32/64 that fits in 64-bits. For each part, it supports all 4 RISC-V multiply ops: mul, mulh, mulhsu, and mulhu.

Review and comments welcome.

The code is at https://salsa.debian.org/Kazan-team/simple-barrel-processor/blob/master/src/multiply.py
The tests are at https://salsa.debian.org/Kazan-team/simple-barrel-processor/blob/master/test/test_multiply.py
Comment 8 Luke Kenneth Casson Leighton 2019-05-17 08:27:04 BST
ran the unit tests:

.E......
======================================================================
ERROR: get_test_cases (test.test_multiply.TestMul8_16_32_64)
----------------------------------------------------------------------
TypeError: get_test_cases() missing 2 required positional arguments: 'lanes' and 'keys'

----------------------------------------------------------------------
Ran 8 tests in 57.919s

FAILED (errors=1)
Comment 9 Jacob Lifshay 2019-05-17 08:35:47 BST
(In reply to Luke Kenneth Casson Leighton from comment #8)
> ran the unit tests:
> 
> .E......
> ======================================================================
> ERROR: get_test_cases (test.test_multiply.TestMul8_16_32_64)
> ----------------------------------------------------------------------
> TypeError: get_test_cases() missing 2 required positional arguments: 'lanes'
> and 'keys'
> 
> ----------------------------------------------------------------------
> Ran 8 tests in 57.919s
> 
> FAILED (errors=1)

get_test_cases is not a unit test case, it's a utility function. I think it's getting called by python's unittest package because you're using an older (maybe newer?) version of python that checks for the string "test" anywhere in the function name rather than just at the beginning.

Also, I find it takes a lot less time using pypy3 rather than python3. I can get it to run all the tests in around 15-20s.
Comment 10 Luke Kenneth Casson Leighton 2019-05-17 08:47:20 BST
(In reply to Jacob Lifshay from comment #9)
> (In reply to Luke Kenneth Casson Leighton from comment #8)
> > ran the unit tests:
> > 
> > .E......
> > ======================================================================
> > ERROR: get_test_cases (test.test_multiply.TestMul8_16_32_64)
> > ----------------------------------------------------------------------
> > TypeError: get_test_cases() missing 2 required positional arguments: 'lanes'
> > and 'keys'
> > 
> > ----------------------------------------------------------------------
> > Ran 8 tests in 57.919s
> > 
> > FAILED (errors=1)
> 
> get_test_cases is not a unit test case, it's a utility function. I think
> it's getting called by python's unittest package because you're using an
> older (maybe newer?) version of python that checks for the string "test"
> anywhere in the function name rather than just at the beginning.

 ah i ran nosetests3, that'll be why.

> Also, I find it takes a lot less time using pypy3 rather than python3. I can
> get it to run all the tests in around 15-20s.

 oo that's interesting.  and *sigh* of course it doesn't respect the
 python3 library paths properly or in full *sigh*.  i'll work that out
 later.
Comment 11 Jacob Lifshay 2019-05-17 08:58:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #10)
> (In reply to Jacob Lifshay from comment #9)
> > Also, I find it takes a lot less time using pypy3 rather than python3. I can
> > get it to run all the tests in around 15-20s.
> 
>  oo that's interesting.  and *sigh* of course it doesn't respect the
>  python3 library paths properly or in full *sigh*.  i'll work that out
>  later.

I use pypy3 v7.1.1-beta from https://pypy.org/download.html
I just treat it as a completely separate python installation, so I used "pypy3 -m pip" to install everything.
Comment 12 Luke Kenneth Casson Leighton 2019-05-17 09:22:19 BST
(In reply to Jacob Lifshay from comment #7)
> I implemented the multiplier, it passes all tests.

 fantastic.

> I don't have any tests for having pipeline registers in the multiplier yet.

 ok, so this is for multi-stage multiply?

> I also didn't yet add the adder input for FMA, that should be relatively
> trivial.

 yes.

> I also need to add additional documentation.
> 
> The multiplier supports every combination of aligned 8/16/32/64 that fits in
> 64-bits. For each part, it supports all 4 RISC-V multiply ops: mul, mulh,
> mulhsu, and mulhu.

 excellent.  it likely qualifies as "completed milestone".  we can move
 more extensive unit tests to a separate milestone.


> Review and comments welcome.

 the formatting is a little... weird.

 it's generally good practice to use vertical alignment to make it easier
 for the reader to spot errors and differences vertically:

        lanes = [SIMDMulLane(True,
                             True,
                             8,
                             True),
                 SIMDMulLane(False,
                             False,
                             8,
                             True),
                 SIMDMulLane(True,
                             True,
                             16,
                             False),
                 SIMDMulLane(True,
                             False,
                             32,
                             True)]

 is almost impossible to see what the differences are between the 4 cases.

        lanes = [SIMDMulLane(True, True, 8, True),
                 SIMDMulLane(False, False, 8, True),
                 SIMDMulLane(True, True, 16, False),
                 SIMDMulLane(True, False, 32, True)]

 is trivial to see what the differences are

        lanes = [SIMDMulLane(True,  True,  8,  True),
                 SIMDMulLane(False, False, 8,  True),
                 SIMDMulLane(True,  True,  16, False),
                 SIMDMulLane(True,  False, 32, True)]

 is extremely clear what the differences between the four cases are.


----

            def async_process() -> AsyncProcessGenerator:
                for a_signed in False, True:

 is so heavily indented (due to it being a nested function) that again,
 readability is compromised.

                yield from self.subtest_mul8_16_32_64([SIMDMulLane(False,
                                                                   False,
                                                                   32,
                                                                   True),

 moving it to a global function would reduce the indentation by a whopping
 3 levels, allowing the code size to be hugely reduced in the process


----

 all of these:

        self._intermediate_output = Signal(128)
        self._part_8 = [Signal(name=f"_part_8_{i}") for i in range(8)]
        self._part_16 = [Signal(name=f"_part_16_{i}") for i in range(4)]
        self._part_32 = [Signal(name=f"_part_32_{i}") for i in range(2)]
        self._part_64 = Signal()
        self._output_64 = Signal(64)
        self._output_32 = Signal(64)
        self._output_16 = Signal(64)
        self._output_8 = Signal(64)
        self._a_signed = [Signal(name=f"_a_signed_{i}") for i in range(8)]
        self._b_signed = [Signal(name=f"_b_signed_{i}") for i in range(8)]
        self._not_a_term_8 = Signal(128)
        self._neg_lsb_a_term_8 = Signal(128)
        self._not_b_term_8 = Signal(128)
        self._neg_lsb_b_term_8 = Signal(128)
        self._not_a_term_16 = Signal(128)
        self._neg_lsb_a_term_16 = Signal(128)
        self._not_b_term_16 = Signal(128)
        self._neg_lsb_b_term_16 = Signal(128)
        self._not_a_term_32 = Signal(128)
        self._neg_lsb_a_term_32 = Signal(128)
        self._not_b_term_32 = Signal(128)
        self._neg_lsb_b_term_32 = Signal(128)
        self._not_a_term_64 = Signal(128)
        self._neg_lsb_a_term_64 = Signal(128)
        self._not_b_term_64 = Signal(128)
        self._neg_lsb_b_term_64 = Signal(128)

 can be moved *into* the elaborate function, the self. removed *and* the
 underscore.

 they're entirely local to the function named elaborate: therefore there
 is absolutely no need to expose them (at all) to the public API.

 that they're in the __init__ means that it is almost impossible to
 work out what the inputs and outputs of the function are.

 unless there is an intention to derive from Mul8_16_32_64 at some point
 in the future?

 commenting and grouping the inputs and outputs is pretty essential.

        # are these inputs, are they outputs? or intermediaries?
        # it's impossible to tell
        self.part_pts = PartitionPoints()
        for i in range(8, 64, 8):
            self.part_pts[i] = Signal(name=f"part_pts_{i}")
        self.part_op = [Signal(2, name=f"part_op_{i}") for i in range(8)]

        # input operand
        self.a = Signal(64)
        self.b = Signal(64)

        # output operand
        self.output = Signal(64)

----
 

 add_term is small enough (in this case) to keep as a locally-scoped function.
 the use of typing however is still interfering with readability, by forcing
 the unnecessary single-line-per-function-argument that most python developers
 are used to.

 
-----

    def __init__(self, partition_points: PartitionPointsIn = {}):
        super().__init__()

 no!  it is *absolutely essential* never to declare a dictionary as an
 argument to a function!

 this is the pattern for creating a *singleton*!

 any modification of that argument, partition_points, will result in *all*
 uses of that class seeing the *modified* partition_points!

 this is down to dictionaries being modifiable.  if it was a tuple () it
 would be ok, because tuples are not modifiable after creation.

 replace with this:

    def __init__(self, partition_points: PartitionPointsIn = None):
        super().__init__()
        if partition_points is None:
            partition_points = {}


----

        for not_a_term, \
            neg_lsb_a_term, \
            not_b_term, \
            neg_lsb_b_term, \
            parts in [
                (self._not_a_term_8, 
                 self._neg_lsb_a_term_8,
                 self._not_b_term_8,
                 self._neg_lsb_b_term_8,
                 self._part_8),
                (self._not_a_term_16,
                 self._neg_lsb_a_term_16,
                 self._not_b_term_16,
                 self._neg_lsb_b_term_16,
                 self._part_16),
                (self._not_a_term_32,
                 self._neg_lsb_a_term_32,
                 self._not_b_term_32,
                 self._neg_lsb_b_term_32,
                 self._part_32),
                (self._not_a_term_64,
                 self._neg_lsb_a_term_64,
                 self._not_b_term_64,
                 self._neg_lsb_b_term_64,
                 [self._part_64]),
                ]:


 yuck :)

 and unfortunately, the names of those variables are too long (even when
 self._ is removed) to include on one line.

 reducing them to "notaterm_8"  or removing the word "term" - not_a_8,
 neg_lsb_a_32, etc. would allow that to happen.

 this would improve readability greatly and reduce code length (again,
 increasing readability)


---- (continuing from above)

 the fact that they are "terms" may be commented above the block
 (which has a comment missing explaining what the for-loop is for)

        # something about this for-loop, which is impossible to
        # determine from the contents of the loop itself as it's
        # too code-dense, which deliberately includes the word "term"
        # to make it clear that the 4 things are "terms"

        for not_a, neg_lsb_a, not_b, neg_lsb_b, parts in [
           (not_a_8, neg_lsb_a_8, not_b_8, neg_lsb_b_8, part_8),
           ....
           ....

----

 generally, there are comments missing which explain each code block
 and allow the reader to understand what is going on.  in particular,
 any code-heavily-dense block it is essential to have an introductory
 explanation as to what that block is for.


----

        groups = self.full_adder_groups()
        if len(groups) == 0:
            if len(self.inputs) == 0:
                m.d.comb += self.output.eq(0)
            elif len(self.inputs) == 1:
                m.d.comb += self.output.eq(self._resized_inputs[0])
            else:
                assert len(self.inputs) == 2
                adder = PartitionedAdder(len(self.output),
                                         self._reg_partition_points)
                m.submodules.final_adder = adder
                m.d.comb += adder.a.eq(self._resized_inputs[0])
                m.d.comb += adder.b.eq(self._resized_inputs[1])
                m.d.comb += self.output.eq(adder.output)
            return m

 a comment is needed to divide this block from the code above it.
 also, why does the module return early if len(groups) == 0?
Comment 13 Luke Kenneth Casson Leighton 2019-05-17 09:23:16 BST
(In reply to Jacob Lifshay from comment #11)

> I use pypy3 v7.1.1-beta from https://pypy.org/download.html
> I just treat it as a completely separate python installation, so I used
> "pypy3 -m pip" to install everything.

 ok that explains it.  it's identical to just installing a new minor version
 of python3 (3.5, 3.6, 3.7 etc.)
Comment 14 Jacob Lifshay 2019-05-18 03:35:50 BST
(In reply to Luke Kenneth Casson Leighton from comment #12)
> (In reply to Jacob Lifshay from comment #7)
> > I don't have any tests for having pipeline registers in the multiplier yet.
> 
>  ok, so this is for multi-stage multiply?
it has an input that allows you to specify which levels of the carry-save adder tree have registers and which levels are wired straight through.
> 
> > I also didn't yet add the adder input for FMA, that should be relatively
> > trivial.
> 
>  yes.
> 
> > I also need to add additional documentation.
> > 
> > The multiplier supports every combination of aligned 8/16/32/64 that fits in
> > 64-bits. For each part, it supports all 4 RISC-V multiply ops: mul, mulh,
> > mulhsu, and mulhu.
> 
>  excellent.  it likely qualifies as "completed milestone".  we can move
>  more extensive unit tests to a separate milestone.
> 
> 
> > Review and comments welcome.
> 
>  the formatting is a little... weird.
> 
>  it's generally good practice to use vertical alignment to make it easier
>  for the reader to spot errors and differences vertically:
> 
>         lanes = [SIMDMulLane(True,
>                              True,
>                              8,
>                              True),
>                  SIMDMulLane(False,
>                              False,
>                              8,
>                              True),
>                  SIMDMulLane(True,
>                              True,
>                              16,
>                              False),
>                  SIMDMulLane(True,
>                              False,
>                              32,
>                              True)]
> 
>  is almost impossible to see what the differences are between the 4 cases.
> 
>         lanes = [SIMDMulLane(True, True, 8, True),
>                  SIMDMulLane(False, False, 8, True),
>                  SIMDMulLane(True, True, 16, False),
>                  SIMDMulLane(True, False, 32, True)]
> 
>  is trivial to see what the differences are
> 
>         lanes = [SIMDMulLane(True,  True,  8,  True),
>                  SIMDMulLane(False, False, 8,  True),
>                  SIMDMulLane(True,  True,  16, False),
>                  SIMDMulLane(True,  False, 32, True)]
> 
>  is extremely clear what the differences between the four cases are.
yeah, I'll fix that.
> 
> 
> ----
> 
>             def async_process() -> AsyncProcessGenerator:
>                 for a_signed in False, True:
> 
>  is so heavily indented (due to it being a nested function) that again,
>  readability is compromised.
> 
>                 yield from self.subtest_mul8_16_32_64([SIMDMulLane(False,
>                                                                    False,
>                                                                    32,
>                                                                    True),
> 
>  moving it to a global function would reduce the indentation by a whopping
>  3 levels, allowing the code size to be hugely reduced in the process
unfortunately it can't be a global function since it uses self from the surrounding function but it can't take any arguments
> 
> 
> ----
> 
>  all of these:
> 
>         self._intermediate_output = Signal(128)
>         self._part_8 = [Signal(name=f"_part_8_{i}") for i in range(8)]
...
>         self._neg_lsb_b_term_64 = Signal(128)
> 
>  can be moved *into* the elaborate function, the self. removed *and* the
>  underscore.
> 
>  they're entirely local to the function named elaborate: therefore there
>  is absolutely no need to expose them (at all) to the public API.
They're exposed so the test framework can access them, otherwise they are private.
> 
>  that they're in the __init__ means that it is almost impossible to
>  work out what the inputs and outputs of the function are.
> 
>  unless there is an intention to derive from Mul8_16_32_64 at some point
>  in the future?
> 
>  commenting and grouping the inputs and outputs is pretty essential.
> 
>         # are these inputs, are they outputs? or intermediaries?
>         # it's impossible to tell
>         self.part_pts = PartitionPoints()
>         for i in range(8, 64, 8):
>             self.part_pts[i] = Signal(name=f"part_pts_{i}")
>         self.part_op = [Signal(2, name=f"part_op_{i}") for i in range(8)]
> 
>         # input operand
>         self.a = Signal(64)
>         self.b = Signal(64)
> 
>         # output operand
>         self.output = Signal(64)
all of part_pts, part_op, a, and b are inputs, output is an output.
> 
> ----
>  
> 
>  add_term is small enough (in this case) to keep as a locally-scoped
> function.
>  the use of typing however is still interfering with readability, by forcing
>  the unnecessary single-line-per-function-argument that most python
> developers
>  are used to.
> 
>  
> -----
> 
>     def __init__(self, partition_points: PartitionPointsIn = {}):
>         super().__init__()
> 
>  no!  it is *absolutely essential* never to declare a dictionary as an
>  argument to a function!
> 
>  this is the pattern for creating a *singleton*!
> 
I get that, however, partition_points is not modified or stored in an attribute, so having a singleton default argument doesn't hurt and actually might save memory, since an extra dict doesn't need to be allocated each time.

> ----
> 
>         for not_a_term, \
>             neg_lsb_a_term, \
>             not_b_term, \
>             neg_lsb_b_term, \
>             parts in [
>                 (self._not_a_term_8, 
>                  self._neg_lsb_a_term_8,
>                  self._not_b_term_8,
>                  self._neg_lsb_b_term_8,
>                  self._part_8),
>                 (self._not_a_term_16,
>                  self._neg_lsb_a_term_16,
>                  self._not_b_term_16,
>                  self._neg_lsb_b_term_16,
>                  self._part_16),
>                 (self._not_a_term_32,
>                  self._neg_lsb_a_term_32,
>                  self._not_b_term_32,
>                  self._neg_lsb_b_term_32,
>                  self._part_32),
>                 (self._not_a_term_64,
>                  self._neg_lsb_a_term_64,
>                  self._not_b_term_64,
>                  self._neg_lsb_b_term_64,
>                  [self._part_64]),
>                 ]:
> 
> 
>  yuck :)
I should probably change them to arrays or dicts, so there would be a single self._not_a_term initialized to {8: Signal(...), 16: Signal(...), ...}.
similarly for _neg_lsb_a_term, _not_b_term, _neg_lsb_b_term, and _part (which I might name parts)
> 
>  and unfortunately, the names of those variables are too long (even when
>  self._ is removed) to include on one line.
> 
>  reducing them to "notaterm_8"  or removing the word "term" - not_a_8,
>  neg_lsb_a_32, etc. would allow that to happen.
> 
>  this would improve readability greatly and reduce code length (again,
>  increasing readability)
> 
> 
> ---- (continuing from above)
> 
>  the fact that they are "terms" may be commented above the block
>  (which has a comment missing explaining what the for-loop is for)
> 
>         # something about this for-loop, which is impossible to
>         # determine from the contents of the loop itself as it's
>         # too code-dense, which deliberately includes the word "term"
>         # to make it clear that the 4 things are "terms"
> 
>         for not_a, neg_lsb_a, not_b, neg_lsb_b, parts in [
>            (not_a_8, neg_lsb_a_8, not_b_8, neg_lsb_b_8, part_8),
>            ....
>            ....
> 
> ----
> 
>  generally, there are comments missing which explain each code block
>  and allow the reader to understand what is going on.  in particular,
>  any code-heavily-dense block it is essential to have an introductory
>  explanation as to what that block is for.
> 
yeah, I was just writing code to get it to pass the tests, rather than writing lots of comments. I'll be fixing that.
> 
> ----
> 
>         groups = self.full_adder_groups()
>         if len(groups) == 0:
>             if len(self.inputs) == 0:
>                 m.d.comb += self.output.eq(0)
>             elif len(self.inputs) == 1:
>                 m.d.comb += self.output.eq(self._resized_inputs[0])
>             else:
>                 assert len(self.inputs) == 2
>                 adder = PartitionedAdder(len(self.output),
>                                          self._reg_partition_points)
>                 m.submodules.final_adder = adder
>                 m.d.comb += adder.a.eq(self._resized_inputs[0])
>                 m.d.comb += adder.b.eq(self._resized_inputs[1])
>                 m.d.comb += self.output.eq(adder.output)
>             return m
> 
>  a comment is needed to divide this block from the code above it.
>  also, why does the module return early if len(groups) == 0?
it returns early because this is the base case in a recursive module instantiation. if it didn't, there would be a stack overflow at elaborate-time.

basically, each invocation converts the input terms to 2 * (len(inputs) // 3) + len(inputs) % 3 terms for the next recursive call using len(inputs) // 3 carry-save adders, where len(groups) is len(inputs) // 3.

Once len(inputs) < 3, then it executes the base case, which adds the last 2 inputs together using a traditional adder.

I'll be adding that or similar to a comment.
Comment 15 Jacob Lifshay 2019-05-18 03:37:46 BST
I was thinking it might be a good idea to write an update for croudsupply describing the implementation of the multiplier and how I arrived at the code I have now from initial principles.
Comment 16 Luke Kenneth Casson Leighton 2019-05-18 05:25:02 BST
(In reply to Jacob Lifshay from comment #15)
> I was thinking it might be a good idea to write an update for croudsupply
> describing the implementation of the multiplier and how I arrived at the
> code I have now from initial principles.

that's a really good idea.  you have access to the repo.
https://git.libre-riscv.org/?p=crowdsupply.git;a=summary

good responses, good justifications for what you did (on previous comment),
use the answers you gave to create comments.

in particular make sure that the recursive case is commented, as it wasn't
clear that the function *is* used recursively! (no comments to that effect,
i didn't notice).

the singleton thing is so unsafe, it causes so many problems and wastes
so much time trying to track it down, it's just not worth the hassle of
trying to save 1 lines of code.  please use the pattern that i provided
instead.
Comment 17 Luke Kenneth Casson Leighton 2019-05-24 17:14:09 BST
jacob, the simpler way to do what is done below is:
    def __init__(self, partition_points: PartitionPointsIn = None):
        if partition_points is None:
             partition_points = {}
        for ....

there is no need to have the entire block under a conditional "is not None".

if you absolutely must have the entire block under a conditional "is not None"
test, it is better to do this instead:

        if partition_points is None:
            return
        for .....



@@ -34,18 +38,27 @@ class PartitionPoints(Dict[int, Value]):
         * bits 10 <= ``i`` < 16
     """
 
-    def __init__(self, partition_points: PartitionPointsIn = {}):
+    def __init__(self, partition_points: Optional[PartitionPointsIn] = None):
+        """Create a new ``PartitionPoints``.
+
+        :param partition_points: the input partition points to values mapping.
+        """
         super().__init__()
-        for point, enabled in partition_points.items():
-            if not isinstance(point, int):
-                raise TypeError("point must be a non-negative integer")
-            if point < 0:
-                raise ValueError("point must be a non-negative integer")
-            self[point] = Value.wrap(enabled)
+        if partition_points is not None:
+            for point, enabled in partition_points.items():
+                if not isinstance(point, int):
+                    raise TypeError("point must be a non-negative integer")
+                if point < 0:
+                    raise ValueError("point must be a non-negative integer")
+                self[point] = Value.wrap(enabled)
Comment 18 Luke Kenneth Casson Leighton 2019-05-24 17:19:52 BST
the convention for docstrings is:
+        """ Assign ``PartitionPoints`` using ``Signal.eq``.
+        """


+        """ Create a bit-mask from `self`.
+
+            Each bit in the returned mask is clear only if the partition point
+            at the same bit-index is enabled.
+
+            :param width: the bit width of the resulting mask
+        """

* one space after the introductory """
* alignment of ALL text starting lined up with the first word NOT with the """
* ending """ on its own blank line

reasons are as follow:

* space after the introductory """ makes it clear
* alignment of the text vertically makes for increased readability
* the alignmnent """ SPACE happens coincidentally to be a multiple of 4.
  this matches python's 4-space per indentation pep8 rule.
* the use of """ at the end on its own line provides for a single break
  between the __init__ part and the start of the code


indent one space between """ and A

+        """Assign ``PartitionPoints`` using ``Signal.eq``."""

start ending """ on new line

         if set(self.keys()) != set(rhs.keys()):
             raise ValueError("incompatible point set")
         for point, enabled in self.items():
             yield enabled.eq(rhs[point])
 
     def as_mask(self, width: int) -> Value:

indent one space between """ and A

+        """Create a bit-mask from `self`.
+
+        Each bit in the returned mask is clear only if the partition point at
+        the same bit-index is enabled.

indent these by four spaces

+        :param width: the bit width of the resulting mask

indent this by four spaces.

+        """
Comment 19 Luke Kenneth Casson Leighton 2019-05-24 17:33:20 BST
also, having the comment text at the same level as the """ gives the false
impression that the comment is somehow executable (part of the function).

by indenting it, a clear vertical indicator is given that the start and
end """ string-markers can be used to help a reader very rapidly locate
where the comment begins.

if that convention is not followed, the reader has to spend far more time
than is strictly necessary, parsing the text and wondering where the heck
the comment actually ends.

vertical alignment is far more important in python than in other languages,
not just for the code itself, there are conventions that dramatically increase
readability.
Comment 20 Jacob Lifshay 2019-05-24 20:17:01 BST
(In reply to Luke Kenneth Casson Leighton from comment #18)
> the convention for docstrings is:
I'm planning on changing the docs style to match numpy.

For most of the complaints, I had changed the style to match what pydocstyle expects:
http://www.pydocstyle.org/en/stable/error_codes.html
pydocstyle is based on pep257, which follows the same style of having no space after the opening quotes, having the multi-line docstring indented to the same depth as the first line, and having the closing quotes on the same line as the opening quotes for single-line docstrings.
Comment 21 Luke Kenneth Casson Leighton 2019-05-24 21:07:58 BST
(In reply to Jacob Lifshay from comment #20)
> (In reply to Luke Kenneth Casson Leighton from comment #18)
> > the convention for docstrings is:
> I'm planning on changing the docs style to match numpy.
> 
> For most of the complaints, I had changed the style to match what pydocstyle
> expects:
> http://www.pydocstyle.org/en/stable/error_codes.html
> pydocstyle is based on pep257, which follows the same style of having no
> space after the opening quotes, having the multi-line docstring indented to
> the same depth as the first line, and having the closing quotes on the same
> line as the opening quotes for single-line docstrings.

https://github.com/numpy/numpy/blob/master/numpy/core/code_generators/ufunc_docstrings.py

The style is... tolerable. Barely. Placing the oneline summary *under* the opening """ makes for readability by highlighting the function definition. It is still not as good, I feel, as indenting by 8 spaces, which has the advantage of leaving the function def free from visual interference, allowing the reader's brain to utilise less energy to find where the actual function code begins.

Note that by not having a space between the summary and the opening """ this is a numpy style violation.

Must look up pydocstyle, I think there is an autostyle converter for it, just like autopep8.
Comment 22 Luke Kenneth Casson Leighton 2019-05-26 00:57:17 BST
https://github.com/numpy/numpy/blob/master/numpy/core/code_generators/generate_umath.py#L34

I am really shocked that such a major project would have made a decision to smash the comment start hard against the summary text.  It looks absolutely terrible, and the only reason they get away with it is because the numpy project is so large.

Argh noo, the main cpython source does it as well, arrrgh

https://github.com/python/cpython/blob/master/Lib/cmd.py

sigh.
I will have to investigate a bit more with pydocstyle, it is a debian package that autochecks code.

This is kinda important as the documentation generators critically depend on docstrings.

Must raise separate bugreport.
Comment 23 Luke Kenneth Casson Leighton 2019-08-17 11:52:45 BST
i've put multiply.py and test_multiply.py here:
https://git.libre-riscv.org/?p=ieee754fpu.git;a=tree;f=src/ieee754/part_mul_add;h=af3634054ac530f4fdc7b2877fbf370b9b1bb594;hb=43de8ed4528ea7627cfe2cbac9374074b22687ae

i've started the process of creating modules and using Cat() instead of
individual bit-wise assignment, plus identified some invariants and assigned
them to Signals (_part_byte() is a good example) rather than compute them
multiple times.

even with a (new) module called Term, the graphviz diagram on Mul_8_16_32_64
is innnnsaaaaane.

also, there's 8x8 intermediate product results of 16-bits each behind
Muxes: to save power i believe it should be the case that a and b go
through the Muxes (one each)... *then* the results of that go through
the multiply.

otherwise, the multiply is always active (using power) and there's 64 of them.
that would mean (i believe) that on an 8x 8-bit SIMD, 64 multiplications
would still be performed.
Comment 24 Luke Kenneth Casson Leighton 2019-08-17 18:26:32 BST
finally got to the point where multiply.py is subdivided into top-level
interconnected modules where there is *zero* "logic" at the top level,
only names (and renaming associated with module inputs / outputs).

it's still completely mad, however at least it's understandable.