Research into carry-less multiply and other GF(2^N)) operations is needed, as these are commonly used in cryptographic applications. Modules have been written in nmigen with unit tests and Formal Correctness Proofs --- Instructions: * clmul, clmulh, clmulr * cldiv, clrem (maybe) -- decide what divide by zero does * clmadd (maybe) * cltmadd twin for fft (maybe) Using budget estimation factor of 2.70 EUR per line of code from bug #1025 Steps (edit as needed): * basic adaptable modules * CLMulAdd * DONE: module in nmigen-gf * DONE: unit test * DONE: formal * DONE: fix reference to nmutil.openpower_sv_bitmanip_in_wiki.* * commit c38fbbbea24085ce0d719f59bc5ec9e9d329b7eb * nmigen-gf.git/src/nmigen_gf/hdl/clmul.py 77 lines * nmigen-gf.git/src/nmigen_gf/hdl/test/test_clmul.py 110 lines * EUR 500 * CLDivRem * DONE: see if algorithm linked in comment #34 is what's wanted no, it's not: comment #36 * DONE: module in nmigen-gf * DONE: unit test * TODO: formal * commit c38fbbbea24085ce0d719f59bc5ec9e9d329b7eb * nmigen-gf.git/src/nmigen_gf/hdl/cldivrem.py 382 lines * nmigen-gf.git/src/nmigen_gf/hdl/test/test_cldivrem.py 294 lines * about half copy-paste, so only counting 150 lines * estimate formal test at 200 lines * EUR 2000 -- EUR 1500 without formal test the following tasks have a merged budget estimate of EUR 500: * nothing to do: add encoding of cl* to SVP64Asm class (as a 32bit op) * rendered unnecessary thanks to insndb * TODO: Instruction Encodings * estimate at 20 lines of code/text * TODO: add cl* to TBD pipe(s) * estimated equivalent to 50 lines since the rest is mostly just copying * TODO: fu unit tests * estimated equivalent to 50 lines since the rest is mostly just copying * TODO: fu formal * just for clmul since cldivrem is most likely only possible to formally prove for small word sizes -- 64-bits is likely too big * estimated equivalent to 50 lines since the rest is mostly just copying

I added CLMulAdd, a combinatorial carry-less multiply-add unit that uses tree-reduction to do all the XORs in the multiply add, that way, a 64x64 clmul should have a gate delay of 7 (1 layer of and gates and 6-deep trees of xor gates). https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;h=e681da8ca9d8c9ff461eba9f3ff045e40f249dc2 Here's a nice 4x4 clmul that I generated: https://ftp.libre-soc.org/clmul_4x4.svg Commands: python src/nmutil/test/test_clmul.py TestCLMulAdd.test_4x4 yosys <<'EOF' read_rtlil sim_test_out/__main__.TestCLMulAdd.test_4x4/0.il flatten synth ;;; show -stretch -colors 5 -format svg -prefix clmul_4x4 EOF

A 64x64+64 clmuladd gives: === top === Number of wires: 8076 Number of wire bits: 17022 Number of public wires: 74 Number of public wire bits: 9020 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 8129 $_AND_ 2731 $_NAND_ 1365 $_XNOR_ 63 $_XOR_ 3970 yosys likes alternating between and/nand and xor/xnor apparently.

I started adding CLDivRem, but ran into a yosys bug: https://github.com/YosysHQ/yosys/issues/3268 I added TestEqualLeadingZeroCount which will be used in the cldivrem algorithm for testing if two GF(2) polynomials have equal degrees. It should have a similar complexity to a binary addition (it uses binary addition internally, the extraneous xor gates from addition are optimized out by yosys). https://git.libre-soc.org/?p=nmigen-gf.git;a=blob;f=src/nmigen_gf/hdl/cldivrem.py;h=9a89c43046e2535fb272f47efefe51fb256db482;hb=423b3a3279e3637614a614ec7038ebaa7cabd6c0

(In reply to Jacob Lifshay from comment #3) > I started adding CLDivRem, but ran into a yosys bug: > https://github.com/YosysHQ/yosys/issues/3268 intriguing. relying on complex switch/case, not so hot, eh? :) > I added TestEqualLeadingZeroCount which will be used in the cldivrem > algorithm for testing if two GF(2) polynomials have equal degrees. hmm hmm i was looking at this and thought, actually, finding the minimum cntlz of two things might be more useful? https://git.libre-soc.org/?p=nmigen-gf.git;a=blob;f=gf_reference/gfpinv.py;h=45b6dbbdd4cb19bdbfa0538fb9dfcb79981655fe;hb=926798b6e232f09a36773be07de2d90e3fd431a4 > It should > have a similar complexity to a binary addition (it uses binary addition > internally, the extraneous xor gates from addition are optimized out by > yosys). are or should be? relying on complex low-level capability of yosys is not... wise. > https://git.libre-soc.org/?p=nmigen-gf.git;a=blob;f=src/nmigen_gf/hdl/ > cldivrem.py;h=9a89c43046e2535fb272f47efefe51fb256db482; > hb=423b3a3279e3637614a614ec7038ebaa7cabd6c0 nextpnr-ecp5 doesn't have carry-out, and both symbiflow and nextpnr-xilinx's use of CARRY4 are borked at the moment. also the balancing of carry propagation is not very good in yosys. spot the common theme here... :)

73 for i in range(self.width): 74 with m.Switch(Cat(self.a[i], self.b[i])): 75 with m.Case('11'): 76 # both have no leading zeros so far, so set carry 77 m.d.comb += [ 78 addend1[i].eq(1), 79 addend2[i].eq(1), 80 ] 81 with m.Case('01', '10'): 82 # different number of leading zeros, so clear carry 83 m.d.comb += [ 84 addend1[i].eq(0), 85 addend2[i].eq(0), 86 ] 87 with m.Case('00'): 88 # propagate results from lower bits 89 m.d.comb += [ 90 addend1[i].eq(1), 91 addend2[i].eq(0), 92 ] that's just: comb += addend1.eq(a ^ b) comb += addend2.eq(a | b)

my intuition tells me that this may prove useful, to replace the use of an adder. https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/ripple.py;hb=HEAD relying on yosys to spot that certain gates are redundant is not sensible.

(In reply to Jacob Lifshay from comment #1) > I added CLMulAdd, a combinatorial carry-less multiply-add unit nice. > that uses > tree-reduction to do all the XORs in the multiply add, +class BitwiseXorReduce(Elaboratable): + """Bitwise Xor lots of stuff together by using tree-reduction on each bit. + + Properties: if it is the same can you please replace that with the tree_reduce function i created 18 months ago? https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/util.py;h=6f22179eff5ae8078d45f2a0b001a4f6f2045401;hb=e681da8ca9d8c9ff461eba9f3ff045e40f249dc2#l20 code duplication turns into a maintenance nightmare very quickly. also that's not actual tree-reduction, it's a chain that assumes yosys will sort it. i have said multiple times for over 18 months not to do that. > Here's a nice 4x4 clmul that I generated: > https://ftp.libre-soc.org/clmul_4x4.svg it's so prettyyy, fascinating how it creates a mix of NAND AND and XOR.

19 class BitwiseXorReduce(Elaboratable): 29 def __init__(self, input_values): 30 self.input_values = tuple(map(Value.cast, input_values)) this is confusing as hell and so non-standard it will freak people out. 89 def __reduce_inputs(self): 90 for shift in range(self.factor_width): 91 mask = Repl(self.factor2[shift], self.factor_width) 92 yield (self.factor1 & mask) << shift 93 yield from self.terms 94 95 def elaborate(self, platform): 96 m = Module() 97 xor_reduce = BitwiseXorReduce(self.__reduce_inputs()) yeah don't use non-standard styles like this please. have BitwiseXorReduce either explicitly set up Signals as inputs, pass in its list of Signals, or pass in a list of Signal lengths. or, better, just a width and an array quantity. then the zero-extension becomes irrelevant / redundant and can be removed. the assignment to the inputs can be done explicitly by enumerating self.reduce_inputs() and comb assigning thrm one by one to the xor_reduce.input_values using a straightforward zip() for i1, i2 in zip(xor_reduce.input_values, self.reduce_imputs()): comb += i2.eq(i2) also add a docstring to reduce_inputs to let people know it's doing a shift-and-mask, but explain it in terms of why factor2 was set up rather thab repeat what the code actually does.

95 def elaborate(self, platform): 96 m = Module() 97 xor_reduce = BitwiseXorReduce(self.__reduce_inputs()) 98 m.submodules.xor_reduce = xor_reduce 99 m.d.comb += self.output.eq(xor_reduce.output) 100 return m yeah looking at the comment in line 26 that can be entirely replaced with from nmutil.util imprt treereduce comb += treereduce(self.reduce_inputs(), operator.xor) please delete BitwiseXorReduce entirely. if the HDL generated is complex (unreadable in show top) then have self.reduce_inputs() store its yielded terms in intermediary Signals (pass m as a parameter to reduce_inputs in order to do so)

(In reply to Luke Kenneth Casson Leighton from comment #6) > my intuition tells me that this may prove useful, to replace the use > of an adder. > > https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/ripple.py;hb=HEAD actually that ripple logic is exactly what I was trying to avoid...it would generate a 64-deep chain of gates. binary addition is optimized by yosys to use carry look-ahead (they can be depended on not to break that since it would make all binary arithmetic have such a huge latency that basically all cpus using yosys would no longer meet timing). > > relying on yosys to spot that certain gates are redundant is not sensible. it's perfectly sensible since the xor gates have unused outputs therefore they are trivially removed by `opt` (after the addition is converted to individual gates). `abc` (run by synth -- abc is a totally different project that couldn't realistically be borked by yosys's developers too) also has passes to remove unused gates.

(In reply to Luke Kenneth Casson Leighton from comment #4) > (In reply to Jacob Lifshay from comment #3) > > I started adding CLDivRem, but ran into a yosys bug: > > https://github.com/YosysHQ/yosys/issues/3268 > > intriguing. relying on complex switch/case, not so hot, eh? :) the complex switch case came from the dumb-and-obvious method i use in the formal proof in TestEqualLeadingZeroCount. apparently yosys tries to create a hashmap with an entry for each possible input value -- exponential in input size. > > I added TestEqualLeadingZeroCount which will be used in the cldivrem > > algorithm for testing if two GF(2) polynomials have equal degrees. > > hmm hmm i was looking at this and thought, actually, finding the > minimum cntlz of two things might be more useful? that would be useful (or actually the count trailing zeros variant), but not for cldivrem. the minimum cnttz is simply cnttz(a | b), so we don't need a separate implementation of that. > https://git.libre-soc.org/?p=nmigen-gf.git;a=blob;f=gf_reference/gfpinv.py; > h=45b6dbbdd4cb19bdbfa0538fb9dfcb79981655fe; > hb=926798b6e232f09a36773be07de2d90e3fd431a4 > > > It should > > have a similar complexity to a binary addition (it uses binary addition > > internally, the extraneous xor gates from addition are optimized out by > > yosys). > > are or should be? relying on complex low-level capability of yosys is > not... wise. > > > https://git.libre-soc.org/?p=nmigen-gf.git;a=blob;f=src/nmigen_gf/hdl/ > > cldivrem.py;h=9a89c43046e2535fb272f47efefe51fb256db482; > > hb=423b3a3279e3637614a614ec7038ebaa7cabd6c0 > > nextpnr-ecp5 doesn't have carry-out, and both symbiflow and > nextpnr-xilinx's use of CARRY4 are borked at the moment. > also the balancing of carry propagation is not very good in > yosys. it's waay better than a 64-deep chain of gates, otherwise all cpus using yosys would fail timing.

(In reply to Luke Kenneth Casson Leighton from comment #9) > comb += treereduce(self.reduce_inputs(), operator.xor) yeah, that's simpler.

(In reply to Luke Kenneth Casson Leighton from comment #5) > 73 for i in range(self.width): > 74 with m.Switch(Cat(self.a[i], self.b[i])): > 75 with m.Case('11'): > 76 # both have no leading zeros so far, so set carry > 77 m.d.comb += [ > 78 addend1[i].eq(1), > 79 addend2[i].eq(1), > 80 ] > 81 with m.Case('01', '10'): > 82 # different number of leading zeros, so clear carry > 83 m.d.comb += [ > 84 addend1[i].eq(0), > 85 addend2[i].eq(0), > 86 ] > 87 with m.Case('00'): > 88 # propagate results from lower bits > 89 m.d.comb += [ > 90 addend1[i].eq(1), > 91 addend2[i].eq(0), > 92 ] > > that's just: > > comb += addend1.eq(a ^ b) > comb += addend2.eq(a | b) I explicitly have it more complex so I can explain all the cases and to match 1:1 with the reference algorithm.

Luke, you broke the code: https://salsa.debian.org/Kazan-team/mirrors/nmigen-gf/-/jobs/2642438 you keep complaining I never run the tests...apparently I'm not the only one who forgets.

(In reply to Jacob Lifshay from comment #14) > Luke, you broke the code: > https://salsa.debian.org/Kazan-team/mirrors/nmigen-gf/-/jobs/2642438 > > you keep complaining I never run the tests...apparently I'm not the only one > who forgets. i didn't forget: running anything (at all) fails due to the local imports. $ python3 gf_reference/test_cl_gfb_gfp.py Traceback (most recent call last): File "gf_reference/test_cl_gfb_gfp.py", line 1, in <module> from .state import ST ModuleNotFoundError: No module named '__main__.state'; '__main__' is not a package

(In reply to Luke Kenneth Casson Leighton from comment #15) > (In reply to Jacob Lifshay from comment #14) > > Luke, you broke the code: > > https://salsa.debian.org/Kazan-team/mirrors/nmigen-gf/-/jobs/2642438 > > > > you keep complaining I never run the tests...apparently I'm not the only one > > who forgets. > > i didn't forget: running anything (at all) fails due to the > local imports. > > $ python3 gf_reference/test_cl_gfb_gfp.py > Traceback (most recent call last): > File "gf_reference/test_cl_gfb_gfp.py", line 1, in <module> > from .state import ST > ModuleNotFoundError: No module named '__main__.state'; '__main__' is not a > package run it using: pytest -n auto src just like in .gitlab-ci.yml or, if you don't want to use pytest: python3 -m unittest src/nmigen_gf/reference/test_cl_gfb_gfp.py

(In reply to Jacob Lifshay from comment #10) > (In reply to Luke Kenneth Casson Leighton from comment #6) > > my intuition tells me that this may prove useful, to replace the use > > of an adder. > > > > https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/ripple.py;hb=HEAD > > actually that ripple logic is exactly what I was trying to avoid...it would > generate a 64-deep chain of gates. binary addition is optimized by yosys to > use carry look-ahead (they can be depended on not to break that since it > would make all binary arithmetic have such a huge latency that basically all > cpus using yosys would no longer meet timing). that's sound logical reasoning, i like it. then that makes Ripple() borked (i checked: it's borked), which is quite serious as it's intended for use extensively in PartitionedSIMD operators and for the Vector opcodes "set-before-first" (etc). https://libre-soc.org/irclog/%23libre-soc.2022-04-05.log.html#t2022-04-05T20:34:16 can you fix Ripple() then use it here? > > relying on yosys to spot that certain gates are redundant is not sensible. > > it's perfectly sensible since the xor gates have unused outputs therefore > they are trivially removed by `opt` (after the addition is converted to > individual gates). yep. looks sound logic to me, there, too.

(In reply to Jacob Lifshay from comment #13) > (In reply to Luke Kenneth Casson Leighton from comment #5) > > comb += addend1.eq(a ^ b) > > comb += addend2.eq(a | b) > > I explicitly have it more complex so I can explain all the cases and to > match 1:1 with the reference algorithm. that could be achieved with a comment (not too big! the code's really elegant and small without the case statement. something like: """ for i in range(self.width): for each bit a[i] b[i] set addend1[i], addend2[i] case 1,1 both have no leading zeros so far, so set carry 1,1 case 0/1 1/0 different number of leading zeros, so clear carry 0,0 case 0,0 propagate results from lower bits, so set 1,0 this is basically equivalent to: addend1 = a^b and addend2 = a|b """ does that look reasonably obvious?

(In reply to Luke Kenneth Casson Leighton from comment #18) > > that could be achieved with a comment (not too big! the code's really > elegant and small without the case statement. something like: > > """ > for i in range(self.width): > for each bit a[i] b[i] set addend1[i], addend2[i] > case 1,1 both have no leading zeros so far, so set carry 1,1 > case 0/1 1/0 different number of leading zeros, so clear carry 0,0 > case 0,0 propagate results from lower bits, so set 1,0 > > this is basically equivalent to: addend1 = a^b and addend2 = a|b > """ > > does that look reasonably obvious? yes, except that the correct ops are XNOR and AND, not XOR and OR

(In reply to Jacob Lifshay from comment #19) > yes, except that the correct ops are XNOR and AND, not XOR and OR Turns out the correct ops are XOR and AND...take the average of both our goofed expressions and you get the right answer :) https://git.libre-soc.org/?p=nmigen-gf.git;a=blob;f=src/nmigen_gf/hdl/cldivrem.py;h=4ce94ce7618b826ff0641c73bee187a4f2f1aa95;hb=fbb65119f40b7c1c10c05e7fa023972f924fa99e#l76

much better, really short/concise. missing couple of comments though 66 part_prods = [] 67 for shift in range(self.factor_width): 68 part_prod = Signal(self.output.width, name=f"part_prod_{shift}") # what's going on, perhaps an ASCII diagram? small one! 69 mask = Repl(self.factor2[shift], self.factor_width) 70 m.d.comb += part_prod.eq((self.factor1 & mask) << shift) 71 part_prods.append(part_prod) 72 # looks like an actual addition. explain with "calculate pp # [0] ^ pp1 .... 73 output = treereduce(part_prods + self.terms, operator.xor) 74 m.d.comb += self.output.eq(output)

(In reply to Jacob Lifshay from comment #20) > Turns out the correct ops are XOR and AND...take the average of both our > goofed expressions and you get the right answer :) doh :) welcome to my world, where i write a unit test which checks for correctness then keep guessing until it passes...

Luke, in the commit: https://git.libre-soc.org/?p=nmigen-gf.git;a=commitdiff;h=5e94f55a42807721e4dfe56e0c65d9a1583c2794 the changes you made totally mess up the logic...bits in the new `different` variable are now set when the `a` and `b` inputs are the *same*. please either revert the commit (preferred) or both rename the variable to something like `same` and fix the comments. imho reverting is probably best since that way it better matches the reference algorithm.

not quite, but you are right about the name, let's strip back the diff to the bare minimum so you can see the logic is good. you need to look at each individual commit to see it rather than a combined diff because a combined diff throws too many changes together at you: - different = self.a ^ self.b - m.d.comb += addend2.eq(~different) + m.d.comb += different.eq(~(self.a ^ self.b)) so the two operations have been preserved (XOR and NOT) but the name "different" kept, which i didn't spot. what i'll do is, move the NOT into the add-carry, that way the name "different" doesn't change and it keeps to the ref code?

(In reply to Luke Kenneth Casson Leighton from comment #24) > not quite, but you are right about the name, let's strip back > the diff to the bare minimum so you can see the logic is good. > you need to look at each individual commit i did... > to see it rather > than a combined diff because a combined diff throws too many > changes together at you: > what i'll do is, move the NOT into the add-carry, that way the > name "different" doesn't change and it keeps to the ref code? sounds good. the comments may still need to be rephrased to be easier to understand after all the changes.

(In reply to Jacob Lifshay from comment #25) > (In reply to Luke Kenneth Casson Leighton from comment #24) > > not quite, but you are right about the name, let's strip back > > the diff to the bare minimum so you can see the logic is good. the logic is fine if you ignore comments and naming (like the python interpreter mostly does)...hence why the tests passed. > > you need to look at each individual commit > > i did... that's how i posted that link to the one particular commit, not to just a general diff.

(In reply to Jacob Lifshay from comment #26) > that's how i posted that link to the one particular commit, not to just a > general diff. doh. clearly i didn't read it for minimising, sorry. ~ moved into the carrysum.

(In reply to Luke Kenneth Casson Leighton from comment #27) > ~ moved into the carrysum. lgtm except iirc ~different doesn't need parenthesis -- though that doesn't matter much. just occurred to me that: > carry_in = 1 > csum.eq(both_ones + (~different) + carry_in) > self.out.eq(csum[self.width]) is equivalent to: > csum.eq(both_ones - different) > self.out.eq(csum[self.width]) is equivalent to: > self.out.eq(both_ones < different) # unsigned compare that said, imho we should leave it as the addition version

(In reply to Jacob Lifshay from comment #28) > just occurred to me that: > is equivalent to: > > self.out.eq(both_ones < different) # unsigned compare that's hilarious. i Grok that from the Power ISA ops, from Paul and Anton's microwatt work. > that said, imho we should leave it as the addition version hm hm as a "<" it intuitively makes sense to me where carry wasn't. and yosys (etc) may have an easier time. leave it with you to decide, i'm happy either way.

(In reply to Luke Kenneth Casson Leighton from comment #29) > that's hilarious. i Grok that from the Power ISA ops, > from Paul and Anton's microwatt work. :) > > that said, imho we should leave it as the addition version > > hm hm as a "<" it intuitively makes sense to me where carry > wasn't. guess an explanation of how the addition version works is in order... i was thinking in terms of a carry-look-ahead circuit where it generates G (Generate a carry) and P (Propagate a carry) signals from the input addends, and then uses a tree to compute the carry propagation -- G is exactly `both_ones`, and P is exactly `~different` (In reply to Jacob Lifshay from comment #28) > (In reply to Luke Kenneth Casson Leighton from comment #27) > > ~ moved into the carrysum. > > lgtm except iirc ~different doesn't need parenthesis -- though that doesn't > matter much. I totally forgot to check if you fixed the comment, which you didn't...merging the variables totally made the comment incomprehensible, please change all mentions of `both_ones[i]` and `different[i]` (but not the un-indexed variants) in the comment in elaborate() back to talking about how the two inputs to the add operation are set, rather than those variables.

(In reply to Jacob Lifshay from comment #30) > I totally forgot to check if you fixed the comment, which you > didn't...merging the variables totally made the comment incomprehensible, > please change all mentions of `both_ones[i]` and `different[i]` (but not the > un-indexed variants) in the comment in elaborate() back to talking about how > the two inputs to the add operation are set, rather than those variables. honestly i'm likely to mess it up, i can handle code-morph and simplification but don't know enough about the algorithm to comment it correctly.

(In reply to Luke Kenneth Casson Leighton from comment #31) > honestly i'm likely to mess it up, i can handle code-morph and simplification > but don't know enough about the algorithm to comment it correctly. ok, i'll fix it in the morning.

(In reply to Jacob Lifshay from comment #32) > (In reply to Luke Kenneth Casson Leighton from comment #31) > > honestly i'm likely to mess it up, i can handle code-morph and simplification > > but don't know enough about the algorithm to comment it correctly. > > ok, i'll fix it in the morning. fixed: https://git.libre-soc.org/?p=nmigen-gf.git;a=blob;f=src/nmigen_gf/hdl/cldivrem.py;h=f6ca4e83fe85887aae425ad4b523f9801c41500b;hb=ea368079151a9844c93691df3a7256c55cc22e8b#l64

I found a fast algorithm for division of polynomials with finite-field coefficients assuming we have fast multiplication: https://people.csail.mit.edu/madhu/ST12/scribe/lect06.pdf Found it here: https://math.stackexchange.com/q/1475206/204761 this could let cldiv/clrem be expressed with a few clmul ops -- much faster than iterating over each bit of the quotient serially.

*** Bug 134 has been marked as a duplicate of this bug. ***

(In reply to Jacob Lifshay from comment #34) > I found a fast algorithm for division of polynomials with finite-field > coefficients assuming we have fast multiplication: > https://people.csail.mit.edu/madhu/ST12/scribe/lect06.pdf turns out it probably won't work for cldiv/clrem because it gains its speed through hensel lifting -- an algorithm for increasing the modulus. cldiv/clmul requires the modulus to be fixed at exactly 2 and not to vary.

I implemented a FSM for cldivrem: https://git.libre-soc.org/?p=nmigen-gf.git;a=commitdiff;h=49f1c51b676e692f1aa6964fa6e97f6af3464932 it has a parameter so you can select how many stages of the division algorithm it does per clock cycle...allowing you to do something like set that to 4 and have a 64-bit division take 16 clock cycles instead of 64.

I implemented a more efficient algorithm that has a dynamic shift at the beginning and end rather than a EqualLeadingZeroCount every step. This reduces latency to 1 layer of xor gates per step (assuming the step counter is optimized by yosys to just do 1 add per clock), rather than a full 64-bit adder per step -- allowing 8 or maybe 16 steps per clock cycle to be reasonable. I split the shifts out into their own clock cycles at the beginning and end for the FSM (they overlap with reading inputs and writing outputs), so the xor gate layers have a full clock cycle to propagate, allowing increasing the number of xor gates. I ran the comparison by running: python src/nmigen_gf/hdl/test/test_cldivrem.py -k test_64_step_8 yosys <<'EOF' read_rtlil sim_test_out/__main__.TestCLDivRemFSM.test_64_step_8/0.il flatten synth ;;; stat EOF Old algorithm: https://git.libre-soc.org/?p=nmigen-gf.git;a=commit;h=2b87659b26e3063103274eb21a149a92b664a51a modified to add 64-bit 8 steps per clock cycle test to TestCLDivRemFSM: def test_64_step_8(self): self.tst(CLDivRemShape(width=64, n_width=64), full=False, steps_per_clock=8) Number of cells: 10743 $_ANDNOT_ 1960 $_AND_ 187 $_MUX_ 3185 $_NAND_ 457 $_NOR_ 903 $_NOT_ 458 $_ORNOT_ 416 $_OR_ 1824 $_SDFF_PP0_ 320 $_SDFF_PP1_ 1 $_XNOR_ 469 $_XOR_ 563 New algorithm: https://git.libre-soc.org/?p=nmigen-gf.git;a=commit;h=e59b7ccf6c066fc0ecac6410e9b6447d5af77533 Number of cells: 5046 $_ANDNOT_ 302 $_AND_ 67 $_MUX_ 3283 $_NAND_ 12 $_NOR_ 77 $_NOT_ 323 $_ORNOT_ 61 $_OR_ 58 $_SDFFE_PP0P_ 72 $_SDFF_PP0_ 197 $_SDFF_PP1_ 1 $_XNOR_ 149 $_XOR_ 444 I don't know of a fast and easy way to get yosys to output latency numbers, so I'm not testing that.

(In reply to Jacob Lifshay from comment #38) > Number of cells: 5046 woo, halved. > I don't know of a fast and easy way to get yosys to output latency numbers, > so I'm not testing that. there's a command ltp (longest path?) http://yosyshq.net/yosys/cmd_ltp.html

(In reply to Luke Kenneth Casson Leighton from comment #39) > (In reply to Jacob Lifshay from comment #38) > > > Number of cells: 5046 > > woo, halved. > > > I don't know of a fast and easy way to get yosys to output latency numbers, > > so I'm not testing that. > > there's a command ltp (longest path?) > http://yosyshq.net/yosys/cmd_ltp.html ah, missed that. apparently yosys isn't optimizing the adder chain to just 1 add per clock cycle (maybe because it's in a feedback loop so it's not easy to deduce that the low bits of the step counter will always be zeros). I have an idea of how to fix that...split the step counter into two parts: 1 that counts number of clock cycles ahd 1 that counts steps within a clock cycle (not necessarily a power of 2). the within-a-clock-cycle step counter will not be stored in the saved_state register since it's known to always be zero (formal proof will assert that). the clock cycle counter should be trivially optimizable by yosys to 1 adder.

(In reply to Jacob Lifshay from comment #40) > > ah, missed that. > > apparently yosys isn't optimizing the adder chain to just 1 add per clock > cycle fixed that: commit 4ee201e6cae8475621647f7b3b9839292ed0b46f (HEAD -> master, origin/master) Author: Jacob Lifshay <programmerjake@gmail.com> Date: Thu May 5 20:10:32 2022 -0700 split step counter into clock and substep this allows substep to be completely optimized away by yosys for CLDivRemFSMStage I ran the comparison by running: python src/nmigen_gf/hdl/test/test_cldivrem.py -k test_64_step_8 yosys <<'EOF' read_rtlil sim_test_out/__main__.TestCLDivRemFSM.test_64_step_8/0.il flatten synth ;;; stat ltp -noff EOF old unsplit step algorithm: https://git.libre-soc.org/?p=nmigen-gf.git;a=commit;h=e59b7ccf6c066fc0ecac6410e9b6447d5af77533 Number of cells: 5046 $_ANDNOT_ 302 $_AND_ 67 $_MUX_ 3283 $_NAND_ 12 $_NOR_ 77 $_NOT_ 323 $_ORNOT_ 61 $_OR_ 58 $_SDFFE_PP0P_ 72 $_SDFF_PP0_ 197 $_SDFF_PP1_ 1 $_XNOR_ 149 $_XOR_ 444 Longest topological path in top (length=36): New algorithm where step is split into clock and substep: https://git.libre-soc.org/?p=nmigen-gf.git;a=commit;h=4ee201e6cae8475621647f7b3b9839292ed0b46f Number of cells: 4818 $_ANDNOT_ 290 $_AND_ 45 $_MUX_ 3297 $_NAND_ 6 $_NOR_ 68 $_NOT_ 196 $_ORNOT_ 41 $_OR_ 62 $_SDFFE_PP0N_ 6 $_SDFFE_PP0P_ 70 $_SDFF_PP0_ 190 $_SDFF_PP1_ 1 $_XNOR_ 136 $_XOR_ 410 Longest topological path in top (length=31):

forgot to mention I added budget estimates to comment #0

i'm going to declare this one done, as adding the instructions themselves is quite a bit more work. intel's equivalent is PCLMULQDQ. apparently power isa has "polynomial multiply". there is a 64-bit universal hash function very fast that uses it, one for a future grant to investigate. https://arxiv.org/pdf/1503.03465