a 2-stage 64-bit integer multiplier is needed, with capability to support: * signed x signed * signed x unsigned * unsigned x unsigned
I'm planning to design it to have integer fma capabilities, because adding 1 more add into the carry-save adders has less latency than a separate adder. I'm also planning to design it to be splittable to 1x64, 2x32, 4x16, and 8x8-bits, for use in the SIMD FPU/ALU. I had been planning on multiplication taking 3 or 4 cycles, so, for the FPU, the multiplier will take part of the first cycle (leaving space for shifting mantissas into place) and part of the last cycle (leaving space to normalize) and all of the intermediate cycles. If we plan on multiplication taking only 2 cycles, I'm worried that the latency would be high enough that we can't clock it very fast (I'd like to design the latency to allow us to go to at least 1.5GHz or so if we ignore our power limits). In the barrel processor design, I had already designed for a 3-stage multiplier, though that can be relatively easily changed at this point.
3-4 it is.
the shakti group i know did the 64-bit multiply with a series of 8-bit multipliers, then added up the intermediate results. mitch alsup mentioned on comp.arch that there is a modern algorithm for extremely efficient integer-multiply, where we should i feel assume that below a certain granularity that we have access to such. not least, for FPGA demos, FPGAs have DSP blocks. in the IEEE754 FPU code it is extremely easy to instantiate extra units, and creating a SIMD pipeline will be a very very straightforward matter of declaring an array of pipelines, and using Cat() to split the incoming data and push it to the associated unit then reassemble on exit. some synchronisation will be required (if we allow early-out), however that would not be difficult to achieve. so the primary need then will be flexibility (parameterisation) on a single block (as opposed to designing a parallel/SIMD engine): being able to specify the two key parameters of that single block, namely: * the required pipeline depth * the bit-width i say that because if a third parameter is provided (amount of parallelism) it may actually become inconvenient to deploy in the IEEE754 FPU, as the parallelism would be achieved at a level way above the actual integer multiplier component. for an integer-based ALU, on the other hand, that *will* need parallelism, however i feel that such a "parallelising" or "array" capability is still a separate task, needed for many many other purposes i.e. SIMD-ifiying in general, for ADD, SHIFT, FMAC... everything in fact :)
cooooool, found the src/multiplier.py code - recursive code-generation (AddReduce), cooool :) self.partition_points = dict(( (index, Value.wrap(value)) for index, value in partition_points.items())) can be done as: for idx, value in self.partition_points.items(): self.partition_points[idx] = Value.wrap(value) there's probably a... "nicer" way to do it, just can't think of it.
def eq(self, rhs: 'PartitionPoints') -> Iterable[Assign]: if set(self.keys()) != set(rhs.keys()): raise ValueError("incompatible point set") for point, enabled in self.items(): yield enabled.eq(rhs[enabled]) dict.items returns key, value - so that will need to be yiield enabled.eq(rhs[point]) using the words "key, value" - as in the commonly-used phrase "key-value pairs" - helps make it clear. yield value.eq(rhs[key])
(In reply to Luke Kenneth Casson Leighton from comment #5) > def eq(self, rhs: 'PartitionPoints') -> Iterable[Assign]: > if set(self.keys()) != set(rhs.keys()): > raise ValueError("incompatible point set") > for point, enabled in self.items(): > yield enabled.eq(rhs[enabled]) > > > dict.items returns key, value - so that will need to be > yiield enabled.eq(rhs[point]) > > using the words "key, value" - as in the commonly-used phrase > "key-value pairs" - helps make it clear. > > yield value.eq(rhs[key]) oops, fixed
I implemented the multiplier, it passes all tests. I don't have any tests for having pipeline registers in the multiplier yet. I also didn't yet add the adder input for FMA, that should be relatively trivial. I also need to add additional documentation. The multiplier supports every combination of aligned 8/16/32/64 that fits in 64-bits. For each part, it supports all 4 RISC-V multiply ops: mul, mulh, mulhsu, and mulhu. Review and comments welcome. The code is at https://salsa.debian.org/Kazan-team/simple-barrel-processor/blob/master/src/multiply.py The tests are at https://salsa.debian.org/Kazan-team/simple-barrel-processor/blob/master/test/test_multiply.py
ran the unit tests: .E...... ====================================================================== ERROR: get_test_cases (test.test_multiply.TestMul8_16_32_64) ---------------------------------------------------------------------- TypeError: get_test_cases() missing 2 required positional arguments: 'lanes' and 'keys' ---------------------------------------------------------------------- Ran 8 tests in 57.919s FAILED (errors=1)
(In reply to Luke Kenneth Casson Leighton from comment #8) > ran the unit tests: > > .E...... > ====================================================================== > ERROR: get_test_cases (test.test_multiply.TestMul8_16_32_64) > ---------------------------------------------------------------------- > TypeError: get_test_cases() missing 2 required positional arguments: 'lanes' > and 'keys' > > ---------------------------------------------------------------------- > Ran 8 tests in 57.919s > > FAILED (errors=1) get_test_cases is not a unit test case, it's a utility function. I think it's getting called by python's unittest package because you're using an older (maybe newer?) version of python that checks for the string "test" anywhere in the function name rather than just at the beginning. Also, I find it takes a lot less time using pypy3 rather than python3. I can get it to run all the tests in around 15-20s.
(In reply to Jacob Lifshay from comment #9) > (In reply to Luke Kenneth Casson Leighton from comment #8) > > ran the unit tests: > > > > .E...... > > ====================================================================== > > ERROR: get_test_cases (test.test_multiply.TestMul8_16_32_64) > > ---------------------------------------------------------------------- > > TypeError: get_test_cases() missing 2 required positional arguments: 'lanes' > > and 'keys' > > > > ---------------------------------------------------------------------- > > Ran 8 tests in 57.919s > > > > FAILED (errors=1) > > get_test_cases is not a unit test case, it's a utility function. I think > it's getting called by python's unittest package because you're using an > older (maybe newer?) version of python that checks for the string "test" > anywhere in the function name rather than just at the beginning. ah i ran nosetests3, that'll be why. > Also, I find it takes a lot less time using pypy3 rather than python3. I can > get it to run all the tests in around 15-20s. oo that's interesting. and *sigh* of course it doesn't respect the python3 library paths properly or in full *sigh*. i'll work that out later.
(In reply to Luke Kenneth Casson Leighton from comment #10) > (In reply to Jacob Lifshay from comment #9) > > Also, I find it takes a lot less time using pypy3 rather than python3. I can > > get it to run all the tests in around 15-20s. > > oo that's interesting. and *sigh* of course it doesn't respect the > python3 library paths properly or in full *sigh*. i'll work that out > later. I use pypy3 v7.1.1-beta from https://pypy.org/download.html I just treat it as a completely separate python installation, so I used "pypy3 -m pip" to install everything.
(In reply to Jacob Lifshay from comment #7) > I implemented the multiplier, it passes all tests. fantastic. > I don't have any tests for having pipeline registers in the multiplier yet. ok, so this is for multi-stage multiply? > I also didn't yet add the adder input for FMA, that should be relatively > trivial. yes. > I also need to add additional documentation. > > The multiplier supports every combination of aligned 8/16/32/64 that fits in > 64-bits. For each part, it supports all 4 RISC-V multiply ops: mul, mulh, > mulhsu, and mulhu. excellent. it likely qualifies as "completed milestone". we can move more extensive unit tests to a separate milestone. > Review and comments welcome. the formatting is a little... weird. it's generally good practice to use vertical alignment to make it easier for the reader to spot errors and differences vertically: lanes = [SIMDMulLane(True, True, 8, True), SIMDMulLane(False, False, 8, True), SIMDMulLane(True, True, 16, False), SIMDMulLane(True, False, 32, True)] is almost impossible to see what the differences are between the 4 cases. lanes = [SIMDMulLane(True, True, 8, True), SIMDMulLane(False, False, 8, True), SIMDMulLane(True, True, 16, False), SIMDMulLane(True, False, 32, True)] is trivial to see what the differences are lanes = [SIMDMulLane(True, True, 8, True), SIMDMulLane(False, False, 8, True), SIMDMulLane(True, True, 16, False), SIMDMulLane(True, False, 32, True)] is extremely clear what the differences between the four cases are. ---- def async_process() -> AsyncProcessGenerator: for a_signed in False, True: is so heavily indented (due to it being a nested function) that again, readability is compromised. yield from self.subtest_mul8_16_32_64([SIMDMulLane(False, False, 32, True), moving it to a global function would reduce the indentation by a whopping 3 levels, allowing the code size to be hugely reduced in the process ---- all of these: self._intermediate_output = Signal(128) self._part_8 = [Signal(name=f"_part_8_{i}") for i in range(8)] self._part_16 = [Signal(name=f"_part_16_{i}") for i in range(4)] self._part_32 = [Signal(name=f"_part_32_{i}") for i in range(2)] self._part_64 = Signal() self._output_64 = Signal(64) self._output_32 = Signal(64) self._output_16 = Signal(64) self._output_8 = Signal(64) self._a_signed = [Signal(name=f"_a_signed_{i}") for i in range(8)] self._b_signed = [Signal(name=f"_b_signed_{i}") for i in range(8)] self._not_a_term_8 = Signal(128) self._neg_lsb_a_term_8 = Signal(128) self._not_b_term_8 = Signal(128) self._neg_lsb_b_term_8 = Signal(128) self._not_a_term_16 = Signal(128) self._neg_lsb_a_term_16 = Signal(128) self._not_b_term_16 = Signal(128) self._neg_lsb_b_term_16 = Signal(128) self._not_a_term_32 = Signal(128) self._neg_lsb_a_term_32 = Signal(128) self._not_b_term_32 = Signal(128) self._neg_lsb_b_term_32 = Signal(128) self._not_a_term_64 = Signal(128) self._neg_lsb_a_term_64 = Signal(128) self._not_b_term_64 = Signal(128) self._neg_lsb_b_term_64 = Signal(128) can be moved *into* the elaborate function, the self. removed *and* the underscore. they're entirely local to the function named elaborate: therefore there is absolutely no need to expose them (at all) to the public API. that they're in the __init__ means that it is almost impossible to work out what the inputs and outputs of the function are. unless there is an intention to derive from Mul8_16_32_64 at some point in the future? commenting and grouping the inputs and outputs is pretty essential. # are these inputs, are they outputs? or intermediaries? # it's impossible to tell self.part_pts = PartitionPoints() for i in range(8, 64, 8): self.part_pts[i] = Signal(name=f"part_pts_{i}") self.part_op = [Signal(2, name=f"part_op_{i}") for i in range(8)] # input operand self.a = Signal(64) self.b = Signal(64) # output operand self.output = Signal(64) ---- add_term is small enough (in this case) to keep as a locally-scoped function. the use of typing however is still interfering with readability, by forcing the unnecessary single-line-per-function-argument that most python developers are used to. ----- def __init__(self, partition_points: PartitionPointsIn = {}): super().__init__() no! it is *absolutely essential* never to declare a dictionary as an argument to a function! this is the pattern for creating a *singleton*! any modification of that argument, partition_points, will result in *all* uses of that class seeing the *modified* partition_points! this is down to dictionaries being modifiable. if it was a tuple () it would be ok, because tuples are not modifiable after creation. replace with this: def __init__(self, partition_points: PartitionPointsIn = None): super().__init__() if partition_points is None: partition_points = {} ---- for not_a_term, \ neg_lsb_a_term, \ not_b_term, \ neg_lsb_b_term, \ parts in [ (self._not_a_term_8, self._neg_lsb_a_term_8, self._not_b_term_8, self._neg_lsb_b_term_8, self._part_8), (self._not_a_term_16, self._neg_lsb_a_term_16, self._not_b_term_16, self._neg_lsb_b_term_16, self._part_16), (self._not_a_term_32, self._neg_lsb_a_term_32, self._not_b_term_32, self._neg_lsb_b_term_32, self._part_32), (self._not_a_term_64, self._neg_lsb_a_term_64, self._not_b_term_64, self._neg_lsb_b_term_64, [self._part_64]), ]: yuck :) and unfortunately, the names of those variables are too long (even when self._ is removed) to include on one line. reducing them to "notaterm_8" or removing the word "term" - not_a_8, neg_lsb_a_32, etc. would allow that to happen. this would improve readability greatly and reduce code length (again, increasing readability) ---- (continuing from above) the fact that they are "terms" may be commented above the block (which has a comment missing explaining what the for-loop is for) # something about this for-loop, which is impossible to # determine from the contents of the loop itself as it's # too code-dense, which deliberately includes the word "term" # to make it clear that the 4 things are "terms" for not_a, neg_lsb_a, not_b, neg_lsb_b, parts in [ (not_a_8, neg_lsb_a_8, not_b_8, neg_lsb_b_8, part_8), .... .... ---- generally, there are comments missing which explain each code block and allow the reader to understand what is going on. in particular, any code-heavily-dense block it is essential to have an introductory explanation as to what that block is for. ---- groups = self.full_adder_groups() if len(groups) == 0: if len(self.inputs) == 0: m.d.comb += self.output.eq(0) elif len(self.inputs) == 1: m.d.comb += self.output.eq(self._resized_inputs[0]) else: assert len(self.inputs) == 2 adder = PartitionedAdder(len(self.output), self._reg_partition_points) m.submodules.final_adder = adder m.d.comb += adder.a.eq(self._resized_inputs[0]) m.d.comb += adder.b.eq(self._resized_inputs[1]) m.d.comb += self.output.eq(adder.output) return m a comment is needed to divide this block from the code above it. also, why does the module return early if len(groups) == 0?
(In reply to Jacob Lifshay from comment #11) > I use pypy3 v7.1.1-beta from https://pypy.org/download.html > I just treat it as a completely separate python installation, so I used > "pypy3 -m pip" to install everything. ok that explains it. it's identical to just installing a new minor version of python3 (3.5, 3.6, 3.7 etc.)
(In reply to Luke Kenneth Casson Leighton from comment #12) > (In reply to Jacob Lifshay from comment #7) > > I don't have any tests for having pipeline registers in the multiplier yet. > > ok, so this is for multi-stage multiply? it has an input that allows you to specify which levels of the carry-save adder tree have registers and which levels are wired straight through. > > > I also didn't yet add the adder input for FMA, that should be relatively > > trivial. > > yes. > > > I also need to add additional documentation. > > > > The multiplier supports every combination of aligned 8/16/32/64 that fits in > > 64-bits. For each part, it supports all 4 RISC-V multiply ops: mul, mulh, > > mulhsu, and mulhu. > > excellent. it likely qualifies as "completed milestone". we can move > more extensive unit tests to a separate milestone. > > > > Review and comments welcome. > > the formatting is a little... weird. > > it's generally good practice to use vertical alignment to make it easier > for the reader to spot errors and differences vertically: > > lanes = [SIMDMulLane(True, > True, > 8, > True), > SIMDMulLane(False, > False, > 8, > True), > SIMDMulLane(True, > True, > 16, > False), > SIMDMulLane(True, > False, > 32, > True)] > > is almost impossible to see what the differences are between the 4 cases. > > lanes = [SIMDMulLane(True, True, 8, True), > SIMDMulLane(False, False, 8, True), > SIMDMulLane(True, True, 16, False), > SIMDMulLane(True, False, 32, True)] > > is trivial to see what the differences are > > lanes = [SIMDMulLane(True, True, 8, True), > SIMDMulLane(False, False, 8, True), > SIMDMulLane(True, True, 16, False), > SIMDMulLane(True, False, 32, True)] > > is extremely clear what the differences between the four cases are. yeah, I'll fix that. > > > ---- > > def async_process() -> AsyncProcessGenerator: > for a_signed in False, True: > > is so heavily indented (due to it being a nested function) that again, > readability is compromised. > > yield from self.subtest_mul8_16_32_64([SIMDMulLane(False, > False, > 32, > True), > > moving it to a global function would reduce the indentation by a whopping > 3 levels, allowing the code size to be hugely reduced in the process unfortunately it can't be a global function since it uses self from the surrounding function but it can't take any arguments > > > ---- > > all of these: > > self._intermediate_output = Signal(128) > self._part_8 = [Signal(name=f"_part_8_{i}") for i in range(8)] ... > self._neg_lsb_b_term_64 = Signal(128) > > can be moved *into* the elaborate function, the self. removed *and* the > underscore. > > they're entirely local to the function named elaborate: therefore there > is absolutely no need to expose them (at all) to the public API. They're exposed so the test framework can access them, otherwise they are private. > > that they're in the __init__ means that it is almost impossible to > work out what the inputs and outputs of the function are. > > unless there is an intention to derive from Mul8_16_32_64 at some point > in the future? > > commenting and grouping the inputs and outputs is pretty essential. > > # are these inputs, are they outputs? or intermediaries? > # it's impossible to tell > self.part_pts = PartitionPoints() > for i in range(8, 64, 8): > self.part_pts[i] = Signal(name=f"part_pts_{i}") > self.part_op = [Signal(2, name=f"part_op_{i}") for i in range(8)] > > # input operand > self.a = Signal(64) > self.b = Signal(64) > > # output operand > self.output = Signal(64) all of part_pts, part_op, a, and b are inputs, output is an output. > > ---- > > > add_term is small enough (in this case) to keep as a locally-scoped > function. > the use of typing however is still interfering with readability, by forcing > the unnecessary single-line-per-function-argument that most python > developers > are used to. > > > ----- > > def __init__(self, partition_points: PartitionPointsIn = {}): > super().__init__() > > no! it is *absolutely essential* never to declare a dictionary as an > argument to a function! > > this is the pattern for creating a *singleton*! > I get that, however, partition_points is not modified or stored in an attribute, so having a singleton default argument doesn't hurt and actually might save memory, since an extra dict doesn't need to be allocated each time. > ---- > > for not_a_term, \ > neg_lsb_a_term, \ > not_b_term, \ > neg_lsb_b_term, \ > parts in [ > (self._not_a_term_8, > self._neg_lsb_a_term_8, > self._not_b_term_8, > self._neg_lsb_b_term_8, > self._part_8), > (self._not_a_term_16, > self._neg_lsb_a_term_16, > self._not_b_term_16, > self._neg_lsb_b_term_16, > self._part_16), > (self._not_a_term_32, > self._neg_lsb_a_term_32, > self._not_b_term_32, > self._neg_lsb_b_term_32, > self._part_32), > (self._not_a_term_64, > self._neg_lsb_a_term_64, > self._not_b_term_64, > self._neg_lsb_b_term_64, > [self._part_64]), > ]: > > > yuck :) I should probably change them to arrays or dicts, so there would be a single self._not_a_term initialized to {8: Signal(...), 16: Signal(...), ...}. similarly for _neg_lsb_a_term, _not_b_term, _neg_lsb_b_term, and _part (which I might name parts) > > and unfortunately, the names of those variables are too long (even when > self._ is removed) to include on one line. > > reducing them to "notaterm_8" or removing the word "term" - not_a_8, > neg_lsb_a_32, etc. would allow that to happen. > > this would improve readability greatly and reduce code length (again, > increasing readability) > > > ---- (continuing from above) > > the fact that they are "terms" may be commented above the block > (which has a comment missing explaining what the for-loop is for) > > # something about this for-loop, which is impossible to > # determine from the contents of the loop itself as it's > # too code-dense, which deliberately includes the word "term" > # to make it clear that the 4 things are "terms" > > for not_a, neg_lsb_a, not_b, neg_lsb_b, parts in [ > (not_a_8, neg_lsb_a_8, not_b_8, neg_lsb_b_8, part_8), > .... > .... > > ---- > > generally, there are comments missing which explain each code block > and allow the reader to understand what is going on. in particular, > any code-heavily-dense block it is essential to have an introductory > explanation as to what that block is for. > yeah, I was just writing code to get it to pass the tests, rather than writing lots of comments. I'll be fixing that. > > ---- > > groups = self.full_adder_groups() > if len(groups) == 0: > if len(self.inputs) == 0: > m.d.comb += self.output.eq(0) > elif len(self.inputs) == 1: > m.d.comb += self.output.eq(self._resized_inputs[0]) > else: > assert len(self.inputs) == 2 > adder = PartitionedAdder(len(self.output), > self._reg_partition_points) > m.submodules.final_adder = adder > m.d.comb += adder.a.eq(self._resized_inputs[0]) > m.d.comb += adder.b.eq(self._resized_inputs[1]) > m.d.comb += self.output.eq(adder.output) > return m > > a comment is needed to divide this block from the code above it. > also, why does the module return early if len(groups) == 0? it returns early because this is the base case in a recursive module instantiation. if it didn't, there would be a stack overflow at elaborate-time. basically, each invocation converts the input terms to 2 * (len(inputs) // 3) + len(inputs) % 3 terms for the next recursive call using len(inputs) // 3 carry-save adders, where len(groups) is len(inputs) // 3. Once len(inputs) < 3, then it executes the base case, which adds the last 2 inputs together using a traditional adder. I'll be adding that or similar to a comment.
I was thinking it might be a good idea to write an update for croudsupply describing the implementation of the multiplier and how I arrived at the code I have now from initial principles.
(In reply to Jacob Lifshay from comment #15) > I was thinking it might be a good idea to write an update for croudsupply > describing the implementation of the multiplier and how I arrived at the > code I have now from initial principles. that's a really good idea. you have access to the repo. https://git.libre-riscv.org/?p=crowdsupply.git;a=summary good responses, good justifications for what you did (on previous comment), use the answers you gave to create comments. in particular make sure that the recursive case is commented, as it wasn't clear that the function *is* used recursively! (no comments to that effect, i didn't notice). the singleton thing is so unsafe, it causes so many problems and wastes so much time trying to track it down, it's just not worth the hassle of trying to save 1 lines of code. please use the pattern that i provided instead.
jacob, the simpler way to do what is done below is: def __init__(self, partition_points: PartitionPointsIn = None): if partition_points is None: partition_points = {} for .... there is no need to have the entire block under a conditional "is not None". if you absolutely must have the entire block under a conditional "is not None" test, it is better to do this instead: if partition_points is None: return for ..... @@ -34,18 +38,27 @@ class PartitionPoints(Dict[int, Value]): * bits 10 <= ``i`` < 16 """ - def __init__(self, partition_points: PartitionPointsIn = {}): + def __init__(self, partition_points: Optional[PartitionPointsIn] = None): + """Create a new ``PartitionPoints``. + + :param partition_points: the input partition points to values mapping. + """ super().__init__() - for point, enabled in partition_points.items(): - if not isinstance(point, int): - raise TypeError("point must be a non-negative integer") - if point < 0: - raise ValueError("point must be a non-negative integer") - self[point] = Value.wrap(enabled) + if partition_points is not None: + for point, enabled in partition_points.items(): + if not isinstance(point, int): + raise TypeError("point must be a non-negative integer") + if point < 0: + raise ValueError("point must be a non-negative integer") + self[point] = Value.wrap(enabled)
the convention for docstrings is: + """ Assign ``PartitionPoints`` using ``Signal.eq``. + """ + """ Create a bit-mask from `self`. + + Each bit in the returned mask is clear only if the partition point + at the same bit-index is enabled. + + :param width: the bit width of the resulting mask + """ * one space after the introductory """ * alignment of ALL text starting lined up with the first word NOT with the """ * ending """ on its own blank line reasons are as follow: * space after the introductory """ makes it clear * alignment of the text vertically makes for increased readability * the alignmnent """ SPACE happens coincidentally to be a multiple of 4. this matches python's 4-space per indentation pep8 rule. * the use of """ at the end on its own line provides for a single break between the __init__ part and the start of the code indent one space between """ and A + """Assign ``PartitionPoints`` using ``Signal.eq``.""" start ending """ on new line if set(self.keys()) != set(rhs.keys()): raise ValueError("incompatible point set") for point, enabled in self.items(): yield enabled.eq(rhs[point]) def as_mask(self, width: int) -> Value: indent one space between """ and A + """Create a bit-mask from `self`. + + Each bit in the returned mask is clear only if the partition point at + the same bit-index is enabled. indent these by four spaces + :param width: the bit width of the resulting mask indent this by four spaces. + """
also, having the comment text at the same level as the """ gives the false impression that the comment is somehow executable (part of the function). by indenting it, a clear vertical indicator is given that the start and end """ string-markers can be used to help a reader very rapidly locate where the comment begins. if that convention is not followed, the reader has to spend far more time than is strictly necessary, parsing the text and wondering where the heck the comment actually ends. vertical alignment is far more important in python than in other languages, not just for the code itself, there are conventions that dramatically increase readability.
(In reply to Luke Kenneth Casson Leighton from comment #18) > the convention for docstrings is: I'm planning on changing the docs style to match numpy. For most of the complaints, I had changed the style to match what pydocstyle expects: http://www.pydocstyle.org/en/stable/error_codes.html pydocstyle is based on pep257, which follows the same style of having no space after the opening quotes, having the multi-line docstring indented to the same depth as the first line, and having the closing quotes on the same line as the opening quotes for single-line docstrings.
(In reply to Jacob Lifshay from comment #20) > (In reply to Luke Kenneth Casson Leighton from comment #18) > > the convention for docstrings is: > I'm planning on changing the docs style to match numpy. > > For most of the complaints, I had changed the style to match what pydocstyle > expects: > http://www.pydocstyle.org/en/stable/error_codes.html > pydocstyle is based on pep257, which follows the same style of having no > space after the opening quotes, having the multi-line docstring indented to > the same depth as the first line, and having the closing quotes on the same > line as the opening quotes for single-line docstrings. https://github.com/numpy/numpy/blob/master/numpy/core/code_generators/ufunc_docstrings.py The style is... tolerable. Barely. Placing the oneline summary *under* the opening """ makes for readability by highlighting the function definition. It is still not as good, I feel, as indenting by 8 spaces, which has the advantage of leaving the function def free from visual interference, allowing the reader's brain to utilise less energy to find where the actual function code begins. Note that by not having a space between the summary and the opening """ this is a numpy style violation. Must look up pydocstyle, I think there is an autostyle converter for it, just like autopep8.
https://github.com/numpy/numpy/blob/master/numpy/core/code_generators/generate_umath.py#L34 I am really shocked that such a major project would have made a decision to smash the comment start hard against the summary text. It looks absolutely terrible, and the only reason they get away with it is because the numpy project is so large. Argh noo, the main cpython source does it as well, arrrgh https://github.com/python/cpython/blob/master/Lib/cmd.py sigh. I will have to investigate a bit more with pydocstyle, it is a debian package that autochecks code. This is kinda important as the documentation generators critically depend on docstrings. Must raise separate bugreport.
i've put multiply.py and test_multiply.py here: https://git.libre-riscv.org/?p=ieee754fpu.git;a=tree;f=src/ieee754/part_mul_add;h=af3634054ac530f4fdc7b2877fbf370b9b1bb594;hb=43de8ed4528ea7627cfe2cbac9374074b22687ae i've started the process of creating modules and using Cat() instead of individual bit-wise assignment, plus identified some invariants and assigned them to Signals (_part_byte() is a good example) rather than compute them multiple times. even with a (new) module called Term, the graphviz diagram on Mul_8_16_32_64 is innnnsaaaaane. also, there's 8x8 intermediate product results of 16-bits each behind Muxes: to save power i believe it should be the case that a and b go through the Muxes (one each)... *then* the results of that go through the multiply. otherwise, the multiply is always active (using power) and there's 64 of them. that would mean (i believe) that on an 8x 8-bit SIMD, 64 multiplications would still be performed.
finally got to the point where multiply.py is subdivided into top-level interconnected modules where there is *zero* "logic" at the top level, only names (and renaming associated with module inputs / outputs). it's still completely mad, however at least it's understandable.