Bug 784 - Implement cl* instructions for carry-less operations
Summary: Implement cl* instructions for carry-less operations
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's second ASIC
Classification: Unclassified
Component: source code (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Jacob Lifshay
URL:
: 134 (view as bug list)
Depends on: 787
Blocks: 782 785
  Show dependency treegraph
 
Reported: 2022-03-21 02:58 GMT by Jacob Lifshay
Modified: 2023-11-14 01:36 GMT (History)
3 users (show)

See Also:
NLnet milestone: NLnet.2021.02A.052.CryptoRouter
total budget (EUR) for completion of task and all subtasks: 3000
budget (EUR) for this task, excluding subtasks' budget: 3000
parent task for budget allocation: 782
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
lkcl = { amount = 1200, submitted = 2023-09-29, paid = 2023-10-04 } [jacob] amount = 1800 submitted = 2023-10-15 paid = 2023-11-10


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Lifshay 2022-03-21 02:58:22 GMT
Research into carry-less multiply and other GF(2^N)) operations is needed,
as these are commonly used in cryptographic applications.
Modules have been written in nmigen with unit tests and Formal
Correctness Proofs

---

Instructions:
* clmul, clmulh, clmulr
* cldiv, clrem (maybe) -- decide what divide by zero does
* clmadd (maybe)
* cltmadd twin for fft (maybe)

Using budget estimation factor of 2.70 EUR per line of code from bug #1025

Steps (edit as needed):
* basic adaptable modules
  * CLMulAdd
    * DONE: module in nmigen-gf
    * DONE: unit test
    * DONE: formal
    * DONE: fix reference to nmutil.openpower_sv_bitmanip_in_wiki.*
    * commit c38fbbbea24085ce0d719f59bc5ec9e9d329b7eb
      * nmigen-gf.git/src/nmigen_gf/hdl/clmul.py 77 lines
      * nmigen-gf.git/src/nmigen_gf/hdl/test/test_clmul.py 110 lines
    * EUR 500
  * CLDivRem
    * DONE: see if algorithm linked in comment #34 is what's wanted
        no, it's not: comment #36
    * DONE: module in nmigen-gf
    * DONE: unit test
    * TODO: formal
    * commit c38fbbbea24085ce0d719f59bc5ec9e9d329b7eb
      * nmigen-gf.git/src/nmigen_gf/hdl/cldivrem.py 382 lines
      * nmigen-gf.git/src/nmigen_gf/hdl/test/test_cldivrem.py 294 lines
        * about half copy-paste, so only counting 150 lines
      * estimate formal test at 200 lines
    * EUR 2000 -- EUR 1500 without formal test

the following tasks have a merged budget estimate of EUR 500:
* nothing to do: add encoding of cl* to SVP64Asm class (as a 32bit op)
  * rendered unnecessary thanks to insndb
* TODO: Instruction Encodings
  * estimate at 20 lines of code/text
* TODO: add cl* to TBD pipe(s)
  * estimated equivalent to 50 lines since the rest is mostly just copying
* TODO: fu unit tests
  * estimated equivalent to 50 lines since the rest is mostly just copying
* TODO: fu formal
  * just for clmul since cldivrem is most likely only possible
    to formally prove for small word sizes -- 64-bits is likely too big
  * estimated equivalent to 50 lines since the rest is mostly just copying
Comment 1 Jacob Lifshay 2022-03-23 06:07:23 GMT
I added CLMulAdd, a combinatorial carry-less multiply-add unit that uses tree-reduction to do all the XORs in the multiply add, that way, a 64x64 clmul should have a gate delay of 7 (1 layer of and gates and 6-deep trees of xor gates).

https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;h=e681da8ca9d8c9ff461eba9f3ff045e40f249dc2

Here's a nice 4x4 clmul that I generated:
https://ftp.libre-soc.org/clmul_4x4.svg

Commands:
python src/nmutil/test/test_clmul.py TestCLMulAdd.test_4x4
yosys <<'EOF'
read_rtlil sim_test_out/__main__.TestCLMulAdd.test_4x4/0.il
flatten
synth
;;;
show -stretch -colors 5 -format svg -prefix clmul_4x4
EOF
Comment 2 Jacob Lifshay 2022-03-23 06:11:26 GMT
A 64x64+64 clmuladd gives:
=== top ===

   Number of wires:               8076
   Number of wire bits:          17022
   Number of public wires:          74
   Number of public wire bits:    9020
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:               8129
     $_AND_                       2731
     $_NAND_                      1365
     $_XNOR_                        63
     $_XOR_                       3970

yosys likes alternating between and/nand and xor/xnor apparently.
Comment 3 Jacob Lifshay 2022-04-05 06:03:57 BST
I started adding CLDivRem, but ran into a yosys bug:
https://github.com/YosysHQ/yosys/issues/3268

I added TestEqualLeadingZeroCount which will be used in the cldivrem algorithm for testing if two GF(2) polynomials have equal degrees. It should have a similar complexity to a binary addition (it uses binary addition internally, the extraneous xor gates from addition are optimized out by yosys).

https://git.libre-soc.org/?p=nmigen-gf.git;a=blob;f=src/nmigen_gf/hdl/cldivrem.py;h=9a89c43046e2535fb272f47efefe51fb256db482;hb=423b3a3279e3637614a614ec7038ebaa7cabd6c0
Comment 4 Luke Kenneth Casson Leighton 2022-04-05 10:43:56 BST
(In reply to Jacob Lifshay from comment #3)
> I started adding CLDivRem, but ran into a yosys bug:
> https://github.com/YosysHQ/yosys/issues/3268

intriguing.  relying on complex switch/case, not so hot, eh? :)

> I added TestEqualLeadingZeroCount which will be used in the cldivrem
> algorithm for testing if two GF(2) polynomials have equal degrees. 

hmm hmm i was looking at this and thought, actually, finding the
minimum cntlz of two things might be more useful?

https://git.libre-soc.org/?p=nmigen-gf.git;a=blob;f=gf_reference/gfpinv.py;h=45b6dbbdd4cb19bdbfa0538fb9dfcb79981655fe;hb=926798b6e232f09a36773be07de2d90e3fd431a4

> It should
> have a similar complexity to a binary addition (it uses binary addition
> internally, the extraneous xor gates from addition are optimized out by
> yosys).

are or should be? relying on complex low-level capability of yosys is
not... wise.

> https://git.libre-soc.org/?p=nmigen-gf.git;a=blob;f=src/nmigen_gf/hdl/
> cldivrem.py;h=9a89c43046e2535fb272f47efefe51fb256db482;
> hb=423b3a3279e3637614a614ec7038ebaa7cabd6c0

nextpnr-ecp5 doesn't have carry-out, and both symbiflow and
nextpnr-xilinx's use of CARRY4 are borked at the moment.
also the balancing of carry propagation is not very good in
yosys.

spot the common theme here... :)
Comment 5 Luke Kenneth Casson Leighton 2022-04-05 10:50:10 BST
  73         for i in range(self.width):
  74             with m.Switch(Cat(self.a[i], self.b[i])):
  75                 with m.Case('11'):
  76                     # both have no leading zeros so far, so set carry
  77                     m.d.comb += [
  78                         addend1[i].eq(1),
  79                         addend2[i].eq(1),
  80                     ]
  81                 with m.Case('01', '10'):
  82                     # different number of leading zeros, so clear carry
  83                     m.d.comb += [
  84                         addend1[i].eq(0),
  85                         addend2[i].eq(0),
  86                     ]
  87                 with m.Case('00'):
  88                     # propagate results from lower bits
  89                     m.d.comb += [
  90                         addend1[i].eq(1),
  91                         addend2[i].eq(0),
  92                     ]

that's just:

   comb += addend1.eq(a ^ b)
   comb += addend2.eq(a | b)
Comment 6 Luke Kenneth Casson Leighton 2022-04-05 11:42:50 BST
my intuition tells me that this may prove useful, to replace the use
of an adder.

https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/ripple.py;hb=HEAD

relying on yosys to spot that certain gates are redundant is not sensible.
Comment 7 Luke Kenneth Casson Leighton 2022-04-05 12:24:08 BST
(In reply to Jacob Lifshay from comment #1)
> I added CLMulAdd, a combinatorial carry-less multiply-add unit 

nice.

> that uses
> tree-reduction to do all the XORs in the multiply add, 

+class BitwiseXorReduce(Elaboratable):
+    """Bitwise Xor lots of stuff together by using tree-reduction on each bit.
+
+    Properties:

if it is the same can you please replace that with the tree_reduce function i 
created 18 months ago?

https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/util.py;h=6f22179eff5ae8078d45f2a0b001a4f6f2045401;hb=e681da8ca9d8c9ff461eba9f3ff045e40f249dc2#l20

code duplication turns into a maintenance nightmare very quickly.

also that's not actual tree-reduction, it's a chain that assumes yosys will
sort it.  i have said multiple times for over 18 months not to do that.

> Here's a nice 4x4 clmul that I generated:
> https://ftp.libre-soc.org/clmul_4x4.svg

it's so prettyyy, fascinating how it creates a mix of NAND AND and XOR.
Comment 8 Luke Kenneth Casson Leighton 2022-04-05 12:40:35 BST
  19 class BitwiseXorReduce(Elaboratable):
  29     def __init__(self, input_values):
  30         self.input_values = tuple(map(Value.cast, input_values))

this is confusing as hell and so non-standard it will freak people out.

  89     def __reduce_inputs(self):
  90         for shift in range(self.factor_width):
  91             mask = Repl(self.factor2[shift], self.factor_width)
  92             yield (self.factor1 & mask) << shift
  93         yield from self.terms
  94 
  95     def elaborate(self, platform):
  96         m = Module()
  97         xor_reduce = BitwiseXorReduce(self.__reduce_inputs())

yeah don't use non-standard styles like this please.  have BitwiseXorReduce
either explicitly set up Signals as inputs, pass in its list of Signals,
or pass in a list of Signal lengths.

or, better, just a width and an array quantity.  then the zero-extension
becomes irrelevant / redundant and can be removed.

the assignment to the inputs can be done explicitly by enumerating
self.reduce_inputs() and comb assigning thrm one by one to
the xor_reduce.input_values using a straightforward zip()

for i1, i2 in zip(xor_reduce.input_values, self.reduce_imputs()):
    comb += i2.eq(i2)

also add a docstring to reduce_inputs to let people know it's doing
a shift-and-mask, but explain it in terms of why factor2 was set up
rather thab repeat what the code actually does.
Comment 9 Luke Kenneth Casson Leighton 2022-04-05 12:48:59 BST
  95     def elaborate(self, platform):
  96         m = Module()
  97         xor_reduce = BitwiseXorReduce(self.__reduce_inputs())
  98         m.submodules.xor_reduce = xor_reduce
  99         m.d.comb += self.output.eq(xor_reduce.output)
 100         return m

yeah looking at the comment in line 26 that can be entirely
replaced with

from nmutil.util imprt treereduce

     comb += treereduce(self.reduce_inputs(), operator.xor)

please delete BitwiseXorReduce entirely.  if the HDL generated
is complex (unreadable in show top)
then have self.reduce_inputs() store its yielded terms
in intermediary Signals (pass m as a parameter to reduce_inputs
in order to do so)
Comment 10 Jacob Lifshay 2022-04-05 18:33:15 BST
(In reply to Luke Kenneth Casson Leighton from comment #6)
> my intuition tells me that this may prove useful, to replace the use
> of an adder.
> 
> https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/ripple.py;hb=HEAD

actually that ripple logic is exactly what I was trying to avoid...it would generate a 64-deep chain of gates. binary addition is optimized by yosys to use carry look-ahead (they can be depended on not to break that since it would make all binary arithmetic have such a huge latency that basically all cpus using yosys would no longer meet timing).

> 
> relying on yosys to spot that certain gates are redundant is not sensible.

it's perfectly sensible since the xor gates have unused outputs therefore they are trivially removed by `opt` (after the addition is converted to individual gates). `abc` (run by synth -- abc is a totally different project that couldn't realistically be borked by yosys's developers too) also has passes to remove unused gates.
Comment 11 Jacob Lifshay 2022-04-05 18:50:22 BST
(In reply to Luke Kenneth Casson Leighton from comment #4)
> (In reply to Jacob Lifshay from comment #3)
> > I started adding CLDivRem, but ran into a yosys bug:
> > https://github.com/YosysHQ/yosys/issues/3268
> 
> intriguing.  relying on complex switch/case, not so hot, eh? :)

the complex switch case came from the dumb-and-obvious method i use in the formal proof in TestEqualLeadingZeroCount. apparently yosys tries to create a hashmap with an entry for each possible input value -- exponential in input size.

> > I added TestEqualLeadingZeroCount which will be used in the cldivrem
> > algorithm for testing if two GF(2) polynomials have equal degrees. 
> 
> hmm hmm i was looking at this and thought, actually, finding the
> minimum cntlz of two things might be more useful?

that would be useful (or actually the count trailing zeros variant), but not for cldivrem. the minimum cnttz is simply cnttz(a | b), so we don't need a separate implementation of that.

> https://git.libre-soc.org/?p=nmigen-gf.git;a=blob;f=gf_reference/gfpinv.py;
> h=45b6dbbdd4cb19bdbfa0538fb9dfcb79981655fe;
> hb=926798b6e232f09a36773be07de2d90e3fd431a4
> 
> > It should
> > have a similar complexity to a binary addition (it uses binary addition
> > internally, the extraneous xor gates from addition are optimized out by
> > yosys).
> 
> are or should be? relying on complex low-level capability of yosys is
> not... wise.
> 
> > https://git.libre-soc.org/?p=nmigen-gf.git;a=blob;f=src/nmigen_gf/hdl/
> > cldivrem.py;h=9a89c43046e2535fb272f47efefe51fb256db482;
> > hb=423b3a3279e3637614a614ec7038ebaa7cabd6c0
> 
> nextpnr-ecp5 doesn't have carry-out, and both symbiflow and
> nextpnr-xilinx's use of CARRY4 are borked at the moment.
> also the balancing of carry propagation is not very good in
> yosys.

it's waay better than a 64-deep chain of gates, otherwise all cpus using yosys would fail timing.
Comment 12 Jacob Lifshay 2022-04-05 18:51:58 BST
(In reply to Luke Kenneth Casson Leighton from comment #9)
>      comb += treereduce(self.reduce_inputs(), operator.xor)

yeah, that's simpler.
Comment 13 Jacob Lifshay 2022-04-05 18:53:37 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)
>   73         for i in range(self.width):
>   74             with m.Switch(Cat(self.a[i], self.b[i])):
>   75                 with m.Case('11'):
>   76                     # both have no leading zeros so far, so set carry
>   77                     m.d.comb += [
>   78                         addend1[i].eq(1),
>   79                         addend2[i].eq(1),
>   80                     ]
>   81                 with m.Case('01', '10'):
>   82                     # different number of leading zeros, so clear carry
>   83                     m.d.comb += [
>   84                         addend1[i].eq(0),
>   85                         addend2[i].eq(0),
>   86                     ]
>   87                 with m.Case('00'):
>   88                     # propagate results from lower bits
>   89                     m.d.comb += [
>   90                         addend1[i].eq(1),
>   91                         addend2[i].eq(0),
>   92                     ]
> 
> that's just:
> 
>    comb += addend1.eq(a ^ b)
>    comb += addend2.eq(a | b)

I explicitly have it more complex so I can explain all the cases and to match 1:1 with the reference algorithm.
Comment 14 Jacob Lifshay 2022-04-05 19:20:31 BST
Luke, you broke the code:
https://salsa.debian.org/Kazan-team/mirrors/nmigen-gf/-/jobs/2642438

you keep complaining I never run the tests...apparently I'm not the only one who forgets.
Comment 15 Luke Kenneth Casson Leighton 2022-04-05 20:28:55 BST
(In reply to Jacob Lifshay from comment #14)
> Luke, you broke the code:
> https://salsa.debian.org/Kazan-team/mirrors/nmigen-gf/-/jobs/2642438
> 
> you keep complaining I never run the tests...apparently I'm not the only one
> who forgets.

i didn't forget: running anything (at all) fails due to the
local imports.

$ python3 gf_reference/test_cl_gfb_gfp.py 
Traceback (most recent call last):
  File "gf_reference/test_cl_gfb_gfp.py", line 1, in <module>
    from .state import ST
ModuleNotFoundError: No module named '__main__.state'; '__main__' is not a package
Comment 16 Jacob Lifshay 2022-04-05 20:37:04 BST
(In reply to Luke Kenneth Casson Leighton from comment #15)
> (In reply to Jacob Lifshay from comment #14)
> > Luke, you broke the code:
> > https://salsa.debian.org/Kazan-team/mirrors/nmigen-gf/-/jobs/2642438
> > 
> > you keep complaining I never run the tests...apparently I'm not the only one
> > who forgets.
> 
> i didn't forget: running anything (at all) fails due to the
> local imports.
> 
> $ python3 gf_reference/test_cl_gfb_gfp.py 
> Traceback (most recent call last):
>   File "gf_reference/test_cl_gfb_gfp.py", line 1, in <module>
>     from .state import ST
> ModuleNotFoundError: No module named '__main__.state'; '__main__' is not a
> package

run it using:
pytest -n auto src

just like in .gitlab-ci.yml

or, if you don't want to use pytest:
python3 -m unittest src/nmigen_gf/reference/test_cl_gfb_gfp.py
Comment 17 Luke Kenneth Casson Leighton 2022-04-05 20:44:16 BST
(In reply to Jacob Lifshay from comment #10)
> (In reply to Luke Kenneth Casson Leighton from comment #6)
> > my intuition tells me that this may prove useful, to replace the use
> > of an adder.
> > 
> > https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/ripple.py;hb=HEAD
> 
> actually that ripple logic is exactly what I was trying to avoid...it would
> generate a 64-deep chain of gates. binary addition is optimized by yosys to
> use carry look-ahead (they can be depended on not to break that since it
> would make all binary arithmetic have such a huge latency that basically all
> cpus using yosys would no longer meet timing).

that's sound logical reasoning, i like it.

then that makes Ripple() borked (i checked: it's borked), which is quite
serious as it's intended for use extensively in PartitionedSIMD operators
and for the Vector opcodes "set-before-first" (etc).
https://libre-soc.org/irclog/%23libre-soc.2022-04-05.log.html#t2022-04-05T20:34:16

can you fix Ripple() then use it here?

> > relying on yosys to spot that certain gates are redundant is not sensible.
> 
> it's perfectly sensible since the xor gates have unused outputs therefore
> they are trivially removed by `opt` (after the addition is converted to
> individual gates).

yep. looks sound logic to me, there, too.
Comment 18 Luke Kenneth Casson Leighton 2022-04-05 21:15:19 BST
(In reply to Jacob Lifshay from comment #13)
> (In reply to Luke Kenneth Casson Leighton from comment #5)
> >    comb += addend1.eq(a ^ b)
> >    comb += addend2.eq(a | b)
> 
> I explicitly have it more complex so I can explain all the cases and to
> match 1:1 with the reference algorithm.

that could be achieved with a comment (not too big! the code's really
elegant and small without the case statement.  something like:

"""
for i in range(self.width):
   for each bit a[i] b[i] set addend1[i], addend2[i]
      case 1,1     both have no leading zeros so far, so set carry 1,1
      case 0/1 1/0  different number of leading zeros, so clear carry 0,0
      case 0,0      propagate results from lower bits, so set 1,0

this is basically equivalent to: addend1 = a^b and addend2 = a|b
"""

does that look reasonably obvious?
Comment 19 Jacob Lifshay 2022-04-05 21:18:45 BST
(In reply to Luke Kenneth Casson Leighton from comment #18)
> 
> that could be achieved with a comment (not too big! the code's really
> elegant and small without the case statement.  something like:
> 
> """
> for i in range(self.width):
>    for each bit a[i] b[i] set addend1[i], addend2[i]
>       case 1,1     both have no leading zeros so far, so set carry 1,1
>       case 0/1 1/0  different number of leading zeros, so clear carry 0,0
>       case 0,0      propagate results from lower bits, so set 1,0
> 
> this is basically equivalent to: addend1 = a^b and addend2 = a|b
> """
> 
> does that look reasonably obvious?

yes, except that the correct ops are XNOR and AND, not XOR and OR
Comment 20 Jacob Lifshay 2022-04-07 05:25:02 BST
(In reply to Jacob Lifshay from comment #19)
> yes, except that the correct ops are XNOR and AND, not XOR and OR

Turns out the correct ops are XOR and AND...take the average of both our goofed expressions and you get the right answer :)

https://git.libre-soc.org/?p=nmigen-gf.git;a=blob;f=src/nmigen_gf/hdl/cldivrem.py;h=4ce94ce7618b826ff0641c73bee187a4f2f1aa95;hb=fbb65119f40b7c1c10c05e7fa023972f924fa99e#l76
Comment 21 Luke Kenneth Casson Leighton 2022-04-07 14:41:33 BST
much better, really short/concise. missing couple of comments though


  66         part_prods = []
  67         for shift in range(self.factor_width):
  68             part_prod = Signal(self.output.width, name=f"part_prod_{shift}")
                 # what's going on, perhaps an ASCII diagram? small one!
  69             mask = Repl(self.factor2[shift], self.factor_width)
  70             m.d.comb += part_prod.eq((self.factor1 & mask) << shift)
  71             part_prods.append(part_prod)
  72 
             # looks like an actual addition. explain with "calculate pp
             # [0] ^ pp1 ....
  73         output = treereduce(part_prods + self.terms, operator.xor)
  74         m.d.comb += self.output.eq(output)
Comment 22 Luke Kenneth Casson Leighton 2022-04-07 15:32:39 BST
(In reply to Jacob Lifshay from comment #20)

> Turns out the correct ops are XOR and AND...take the average of both our
> goofed expressions and you get the right answer :)

doh :) welcome to my world, where i write a unit test
which checks for correctness then keep guessing until
it passes...
Comment 23 Jacob Lifshay 2022-04-08 08:02:33 BST
Luke, in the commit:
https://git.libre-soc.org/?p=nmigen-gf.git;a=commitdiff;h=5e94f55a42807721e4dfe56e0c65d9a1583c2794

the changes you made totally mess up the logic...bits in the new `different` variable are now set when the `a` and `b` inputs are the *same*. please either revert the commit (preferred) or both rename the variable to something like `same` and fix the comments.

imho reverting is probably best since that way it better matches the reference algorithm.
Comment 24 Luke Kenneth Casson Leighton 2022-04-08 09:11:14 BST
not quite, but you are right about the name, let's strip back
the diff to the bare minimum so you can see the logic is good.
you need to look at each individual commit to see it rather
than a combined diff because a combined diff throws too many
changes together at you:

-        different = self.a ^ self.b
-        m.d.comb += addend2.eq(~different)
+        m.d.comb += different.eq(~(self.a ^ self.b))

so the two operations have been preserved (XOR and NOT)
but the name "different" kept, which i didn't spot.

what i'll do is, move the NOT into the add-carry, that way the
name "different" doesn't change and it keeps to the ref code?
Comment 25 Jacob Lifshay 2022-04-08 09:15:49 BST
(In reply to Luke Kenneth Casson Leighton from comment #24)
> not quite, but you are right about the name, let's strip back
> the diff to the bare minimum so you can see the logic is good.
> you need to look at each individual commit

i did...

> to see it rather
> than a combined diff because a combined diff throws too many
> changes together at you:

> what i'll do is, move the NOT into the add-carry, that way the
> name "different" doesn't change and it keeps to the ref code?

sounds good. the comments may still need to be rephrased to be easier to understand after all the changes.
Comment 26 Jacob Lifshay 2022-04-08 09:19:00 BST
(In reply to Jacob Lifshay from comment #25)
> (In reply to Luke Kenneth Casson Leighton from comment #24)
> > not quite, but you are right about the name, let's strip back
> > the diff to the bare minimum so you can see the logic is good.

the logic is fine if you ignore comments and naming (like the python interpreter mostly does)...hence why the tests passed.

> > you need to look at each individual commit
> 
> i did...

that's how i posted that link to the one particular commit, not to just a general diff.
Comment 27 Luke Kenneth Casson Leighton 2022-04-08 09:31:29 BST
(In reply to Jacob Lifshay from comment #26)

> that's how i posted that link to the one particular commit, not to just a
> general diff.

doh. clearly i didn't read it for minimising, sorry.
~ moved into the carrysum.
Comment 28 Jacob Lifshay 2022-04-08 09:42:42 BST
(In reply to Luke Kenneth Casson Leighton from comment #27)
> ~ moved into the carrysum.

lgtm except iirc ~different doesn't need parenthesis -- though that doesn't matter much.

just occurred to me that:
> carry_in = 1
> csum.eq(both_ones + (~different) + carry_in)
> self.out.eq(csum[self.width])

is equivalent to:
> csum.eq(both_ones - different)
> self.out.eq(csum[self.width])

is equivalent to:
> self.out.eq(both_ones < different) # unsigned compare

that said, imho we should leave it as the addition version
Comment 29 Luke Kenneth Casson Leighton 2022-04-08 10:12:40 BST
(In reply to Jacob Lifshay from comment #28)
 
> just occurred to me that:
 
> is equivalent to:
> > self.out.eq(both_ones < different) # unsigned compare

that's hilarious. i Grok that from the Power ISA ops,
from Paul and Anton's microwatt work.
 
> that said, imho we should leave it as the addition version

hm hm as a "<" it intuitively makes sense to me where carry
wasn't. and yosys (etc) may have an easier time. leave it
with you to decide, i'm happy either way.
Comment 30 Jacob Lifshay 2022-04-08 10:35:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #29)
> that's hilarious. i Grok that from the Power ISA ops,
> from Paul and Anton's microwatt work.

:)

> > that said, imho we should leave it as the addition version
> 
> hm hm as a "<" it intuitively makes sense to me where carry
> wasn't.

guess an explanation of how the addition version works is in order...

i was thinking in terms of a carry-look-ahead circuit where it generates G (Generate a carry) and P (Propagate a carry) signals from the input addends, and then uses a tree to compute the carry propagation -- G is exactly `both_ones`, and P is exactly `~different`

(In reply to Jacob Lifshay from comment #28)
> (In reply to Luke Kenneth Casson Leighton from comment #27)
> > ~ moved into the carrysum.
> 
> lgtm except iirc ~different doesn't need parenthesis -- though that doesn't
> matter much.

I totally forgot to check if you fixed the comment, which you didn't...merging the variables totally made the comment incomprehensible, please change all mentions of `both_ones[i]` and `different[i]` (but not the un-indexed variants) in the comment in elaborate() back to talking about how the two inputs to the add operation are set, rather than those variables.
Comment 31 Luke Kenneth Casson Leighton 2022-04-08 10:40:42 BST
(In reply to Jacob Lifshay from comment #30)

> I totally forgot to check if you fixed the comment, which you
> didn't...merging the variables totally made the comment incomprehensible,
> please change all mentions of `both_ones[i]` and `different[i]` (but not the
> un-indexed variants) in the comment in elaborate() back to talking about how
> the two inputs to the add operation are set, rather than those variables.

honestly i'm likely to mess it up, i can handle code-morph and simplification
but don't know enough about the algorithm to comment it correctly.
Comment 32 Jacob Lifshay 2022-04-08 10:43:11 BST
(In reply to Luke Kenneth Casson Leighton from comment #31)
> honestly i'm likely to mess it up, i can handle code-morph and simplification
> but don't know enough about the algorithm to comment it correctly.

ok, i'll fix it in the morning.
Comment 33 Jacob Lifshay 2022-04-08 23:59:22 BST
(In reply to Jacob Lifshay from comment #32)
> (In reply to Luke Kenneth Casson Leighton from comment #31)
> > honestly i'm likely to mess it up, i can handle code-morph and simplification
> > but don't know enough about the algorithm to comment it correctly.
> 
> ok, i'll fix it in the morning.

fixed:
https://git.libre-soc.org/?p=nmigen-gf.git;a=blob;f=src/nmigen_gf/hdl/cldivrem.py;h=f6ca4e83fe85887aae425ad4b523f9801c41500b;hb=ea368079151a9844c93691df3a7256c55cc22e8b#l64
Comment 34 Jacob Lifshay 2022-05-02 10:14:07 BST
I found a fast algorithm for division of polynomials with finite-field coefficients assuming we have fast multiplication:
https://people.csail.mit.edu/madhu/ST12/scribe/lect06.pdf

Found it here:
https://math.stackexchange.com/q/1475206/204761

this could let cldiv/clrem be expressed with a few clmul ops -- much faster than iterating over each bit of the quotient serially.
Comment 35 Jacob Lifshay 2022-05-02 10:18:19 BST
*** Bug 134 has been marked as a duplicate of this bug. ***
Comment 36 Jacob Lifshay 2022-05-04 03:18:19 BST
(In reply to Jacob Lifshay from comment #34)
> I found a fast algorithm for division of polynomials with finite-field
> coefficients assuming we have fast multiplication:
> https://people.csail.mit.edu/madhu/ST12/scribe/lect06.pdf

turns out it probably won't work for cldiv/clrem because it gains its speed through hensel lifting -- an algorithm for increasing the modulus. cldiv/clmul requires the modulus to be fixed at exactly 2 and not to vary.
Comment 37 Jacob Lifshay 2022-05-04 07:37:11 BST
I implemented a FSM for cldivrem:
https://git.libre-soc.org/?p=nmigen-gf.git;a=commitdiff;h=49f1c51b676e692f1aa6964fa6e97f6af3464932

it has a parameter so you can select how many stages of the division algorithm it does per clock cycle...allowing you to do something like set that to 4 and have a 64-bit division take 16 clock cycles instead of 64.
Comment 38 Jacob Lifshay 2022-05-05 09:06:26 BST
I implemented a more efficient algorithm that has a dynamic shift at the beginning and end rather than a EqualLeadingZeroCount every step. This reduces latency to 1 layer of xor gates per step (assuming the step counter is optimized by yosys to just do 1 add per clock), rather than a full 64-bit adder per step -- allowing 8 or maybe 16 steps per clock cycle to be reasonable. I split the shifts out into their own clock cycles at the beginning and end for the FSM (they overlap with reading inputs and writing outputs), so the xor gate layers have a full clock cycle to propagate, allowing increasing the number of xor gates.

I ran the comparison by running:
python src/nmigen_gf/hdl/test/test_cldivrem.py -k test_64_step_8
yosys <<'EOF'                                                   
read_rtlil sim_test_out/__main__.TestCLDivRemFSM.test_64_step_8/0.il
flatten
synth
;;;
stat
EOF


Old algorithm:
https://git.libre-soc.org/?p=nmigen-gf.git;a=commit;h=2b87659b26e3063103274eb21a149a92b664a51a

modified to add 64-bit 8 steps per clock cycle test to TestCLDivRemFSM:
    def test_64_step_8(self):
        self.tst(CLDivRemShape(width=64, n_width=64),
                 full=False,
                 steps_per_clock=8)

   Number of cells:              10743
     $_ANDNOT_                    1960
     $_AND_                        187
     $_MUX_                       3185
     $_NAND_                       457
     $_NOR_                        903
     $_NOT_                        458
     $_ORNOT_                      416
     $_OR_                        1824
     $_SDFF_PP0_                   320
     $_SDFF_PP1_                     1
     $_XNOR_                       469
     $_XOR_                        563

New algorithm:
https://git.libre-soc.org/?p=nmigen-gf.git;a=commit;h=e59b7ccf6c066fc0ecac6410e9b6447d5af77533

   Number of cells:               5046
     $_ANDNOT_                     302
     $_AND_                         67
     $_MUX_                       3283
     $_NAND_                        12
     $_NOR_                         77
     $_NOT_                        323
     $_ORNOT_                       61
     $_OR_                          58
     $_SDFFE_PP0P_                  72
     $_SDFF_PP0_                   197
     $_SDFF_PP1_                     1
     $_XNOR_                       149
     $_XOR_                        444

I don't know of a fast and easy way to get yosys to output latency numbers, so I'm not testing that.
Comment 39 Luke Kenneth Casson Leighton 2022-05-05 11:34:44 BST
(In reply to Jacob Lifshay from comment #38)

>    Number of cells:               5046

woo, halved.

> I don't know of a fast and easy way to get yosys to output latency numbers,
> so I'm not testing that.

there's a command ltp (longest path?)
http://yosyshq.net/yosys/cmd_ltp.html
Comment 40 Jacob Lifshay 2022-05-05 18:03:46 BST
(In reply to Luke Kenneth Casson Leighton from comment #39)
> (In reply to Jacob Lifshay from comment #38)
> 
> >    Number of cells:               5046
> 
> woo, halved.
> 
> > I don't know of a fast and easy way to get yosys to output latency numbers,
> > so I'm not testing that.
> 
> there's a command ltp (longest path?)
> http://yosyshq.net/yosys/cmd_ltp.html

ah, missed that.

apparently yosys isn't optimizing the adder chain to just 1 add per clock cycle (maybe because it's in a feedback loop so it's not easy to deduce that the low bits of the step counter will always be zeros). I have an idea of how to fix that...split the step counter into two parts: 1 that counts number of clock cycles ahd 1 that counts steps within a clock cycle (not necessarily a power of 2). the within-a-clock-cycle step counter will not be stored in the saved_state register since it's known to always be zero (formal proof will assert that). the clock cycle counter should be trivially optimizable by yosys to 1 adder.
Comment 41 Jacob Lifshay 2022-05-06 04:24:42 BST
(In reply to Jacob Lifshay from comment #40)
> 
> ah, missed that.
> 
> apparently yosys isn't optimizing the adder chain to just 1 add per clock
> cycle

fixed that:
commit 4ee201e6cae8475621647f7b3b9839292ed0b46f (HEAD -> master, origin/master)
Author: Jacob Lifshay <programmerjake@gmail.com>
Date:   Thu May 5 20:10:32 2022 -0700

    split step counter into clock and substep
    
    this allows substep to be completely optimized away by yosys for CLDivRemFSMStage

I ran the comparison by running:
python src/nmigen_gf/hdl/test/test_cldivrem.py -k test_64_step_8
yosys <<'EOF'                                                   
read_rtlil sim_test_out/__main__.TestCLDivRemFSM.test_64_step_8/0.il
flatten
synth
;;;
stat
ltp -noff
EOF

old unsplit step algorithm:
https://git.libre-soc.org/?p=nmigen-gf.git;a=commit;h=e59b7ccf6c066fc0ecac6410e9b6447d5af77533

   Number of cells:               5046
     $_ANDNOT_                     302
     $_AND_                         67
     $_MUX_                       3283
     $_NAND_                        12
     $_NOR_                         77
     $_NOT_                        323
     $_ORNOT_                       61
     $_OR_                          58
     $_SDFFE_PP0P_                  72
     $_SDFF_PP0_                   197
     $_SDFF_PP1_                     1
     $_XNOR_                       149
     $_XOR_                        444

Longest topological path in top (length=36):

New algorithm where step is split into clock and substep:
https://git.libre-soc.org/?p=nmigen-gf.git;a=commit;h=4ee201e6cae8475621647f7b3b9839292ed0b46f

   Number of cells:               4818
     $_ANDNOT_                     290
     $_AND_                         45
     $_MUX_                       3297
     $_NAND_                         6
     $_NOR_                         68
     $_NOT_                        196
     $_ORNOT_                       41
     $_OR_                          62
     $_SDFFE_PP0N_                   6
     $_SDFFE_PP0P_                  70
     $_SDFF_PP0_                   190
     $_SDFF_PP1_                     1
     $_XNOR_                       136
     $_XOR_                        410

Longest topological path in top (length=31):
Comment 42 Jacob Lifshay 2023-09-08 03:53:20 BST
forgot to mention I added budget estimates to comment #0
Comment 43 Luke Kenneth Casson Leighton 2023-09-23 21:27:54 BST
i'm going to declare this one done, as adding the instructions
themselves is quite a bit more work. intel's equivalent
is PCLMULQDQ. apparently power isa has "polynomial multiply".

there is a 64-bit universal hash function very fast that uses it,
one for a future grant to investigate.
https://arxiv.org/pdf/1503.03465