Bug 208 - implement CORDIC in a general way sufficient to do transcendentals
Summary: implement CORDIC in a general way sufficient to do transcendentals
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: ALU (including IEEE754 16/32/64-bit FPU) (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Michael Nolan
URL:
Depends on:
Blocks: 53 127
  Show dependency treegraph
 
Reported: 2020-03-04 18:35 GMT by Luke Kenneth Casson Leighton
Modified: 2020-09-21 17:27 BST (History)
3 users (show)

See Also:
NLnet milestone: NLnet.2019.02
total budget (EUR) for completion of task and all subtasks: 750
budget (EUR) for this task, excluding subtasks' budget: 750
parent task for budget allocation: 53
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments
found random source code (python) for cordic (1.68 KB, text/x-python)
2020-03-18 13:10 GMT, Luke Kenneth Casson Leighton
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-03-04 18:35:42 GMT
see https://libre-soc.org/resources/ page for more links to CORDIC

see bug #127 for research: CORDIC is a binary search algorithm that can do a huge number of other algorithms: SIN, COS, LOG and more.

it will need to be pipelined, split out in a similar fashion to the FPDIV pipeline.

it would be good to have a multi-radix version which instead of a single bit per combinatorial block will do N bits by generating a cascade of adds and comparators, hopefully this will be more efficient than a cascade of combinatorial singlebit blocks

the ATAN2 tables are going to need to be big enough to do 64 bit sin/cos etc.  they could get huge. they will be ROM but still huge.

although it would be nice to consider having the tables in exponent-mantissa form, thus maintaining extra accuracy without wasting bits by needing really large adders, it may be a good idea to add that as a secondary phase.

also nice to have: complex numbers.  these make certain other algorithms also possible to do.  again, a tertiary phase.

the input and output should be done as integers only, "scaled", and then, after demonstrating that (when converting input in unit tests to integers) sin and cos work.

*after* that, the next phase will be to make an IEEE754 FP cos and sin, plus unit tests.  this as a sub-bug of this one.

then, other IEEE754 functions can be added as well.

this is a simple CORDIC example which does sin and cos in myhdl, the nmigen could should be near identical in the inner loop
http://www.myhdl.org/docs/examples/sinecomp/
Comment 1 Luke Kenneth Casson Leighton 2020-03-08 00:28:09 GMT
btw michael, we did FPDIV in separate passes as well (took a couple of months).  jacob researched algorithms that would cover div, sqrt and rsqrt, and did them as a simple integer only pipeline.  then i hooked that up to the FP infrastructure and it worked well.

i can recommend an incremental approach here as well: start from something small such as converting the myhdl code to nmigen in order to avoid bogging down with the FP infrastructure *and* with the many variations of CORDIC.

it is only around 70 lines of code so very basic: don't even pipeline it at first, make it a FSM with a counter. then once that is achieved move on to the next small incremental improvement.
Comment 2 Luke Kenneth Casson Leighton 2020-03-18 13:10:44 GMT
Created attachment 45 [details]
found random source code (python) for cordic

last year i did a lot of exploration of CORDIC, and found this in a working tree, not checked in.  i added a "diff" to the original, so it can be seen to be accurate compared to python math.

i think it might be a good idea to try it against python bigfloat, to see if higher accuracy on the atan table creates better accuracy.

the number of iterations required is quite enormous: *45* iterations to get 14 D.P. accuracy, however doing more iterations doesn't really help (100 iterations only gets 16-17 dp)  so i'm wondering how much the accuracy of the atan tables affects the output, and that means using bigfloat.
Comment 3 Michael Nolan 2020-04-13 15:35:08 BST
I was able to translate the linked myhdl code to nmigen and implement both a state machine based and pipelined cordic here: https://git.libre-soc.org/?p=ieee754fpu.git;a=tree;f=src/ieee754/cordic;h=3f06900baa42651410d661739fb616e4b3020c52;hb=77f150cff440bed025e2487b6f0fcda9f529290b

You can configure the width, as well as the number of cordic rounds per pipeline stage. There are some accuracy issues with it though - it exactly matches the behavior of the myhdl one, but the results it gives can be off by 1% or so from that obtained using sin() or cos() on 16 bit numbers.
Comment 4 Luke Kenneth Casson Leighton 2020-04-13 18:23:10 BST
(In reply to Michael Nolan from comment #3)
> I was able to translate the linked myhdl code to nmigen and implement both a
> state machine based and pipelined cordic here:
> https://git.libre-soc.org/?p=ieee754fpu.git;a=tree;f=src/ieee754/cordic;
> h=3f06900baa42651410d661739fb616e4b3020c52;
> hb=77f150cff440bed025e2487b6f0fcda9f529290b

cool!  well that was easy :)

> You can configure the width, as well as the number of cordic rounds per
> pipeline stage. There are some accuracy issues with it though - it exactly
> matches the behavior of the myhdl one, but the results it gives can be off
> by 1% or so from that obtained using sin() or cos() on 16 bit numbers.

yes.  i believe it's well-known that to get 16-bit accuracy you need
32-bit registers.

i made a couple of corrections btw - frickin well done for working out
the pipeline API with zero assistance - that's deeply impressive given
that, other than the docstring comments, there's no "documentation" on
it.

CordicInitialData needed to "yield self.z0" (not self.z)

"yield from" is reserved for a way to yield *from* something that is
itself iteratable (such as another function).  self.z0 is in fact
iterable, so what would be returned by "yield from" would be first
self.z0[0], then on the next iteration self.z0[1] and so on which
would be very painful.

"yield" just returns "that object".

however... we have a bit of a problem back in nmutil.iocontrol NextControl
and PrevControl iteration, in that it can't actually properly work out
if the *object* that you are iterating should be yielded or yielded *from*.
so i added a hack "check if data has a ports function".

however i've just right now worked out that we could detect if the object
is iterable with isinstance(self.data_o, Iterable).
so that's just been committed.

i also added an rtlil test because i was fascinated to see what the
graphviz looks like.


--- a/src/ieee754/cordic/pipe_data.py
+++ b/src/ieee754/cordic/pipe_data.py
@@ -10,11 +10,14 @@ class CordicInitialData:
         self.z0 = Signal(range(-ZMAX, ZMAX), name="z")     # denormed result
 
     def __iter__(self):
-        yield from self.z
+        yield self.z0
 
     def eq(self, i):
         return [self.z0.eq(i.z0)]
 
 class CordicData:
 
@@ -27,9 +30,12 @@ class CordicData:
         self.z = Signal(range(-ZMAX, ZMAX), name="z")     # denormed result
 
     def __iter__(self):
-        yield from self.x
-        yield from self.y
-        yield from self.z
+        yield self.x
+        yield self.y
+        yield self.z
 
     def eq(self, i):
         ret = [self.z.eq(i.z), self.x.eq(i.x), self.y.eq(i.y)]
diff --git a/src/ieee754/cordic/test/test_pipe.py b/src/ieee754/cordic/test/test_pipe.py
index 880351a..b7eaf8a 100644
--- a/src/ieee754/cordic/test/test_pipe.py
+++ b/src/ieee754/cordic/test/test_pipe.py
@@ -1,6 +1,7 @@
 from nmigen import Module, Signal
 from nmigen.back.pysim import Simulator, Passive
 from nmigen.test.utils import FHDLTestCase
+from nmigen.cli import rtlil
 
 from ieee754.cordic.sin_cos_pipeline import CordicBasePipe
 from ieee754.cordic.pipe_data import CordicPipeSpec
@@ -16,6 +17,13 @@ class SinCosTestCase(FHDLTestCase):
         pspec = CordicPipeSpec(fracbits=fracbits, rounds_per_stage=4)
         m.submodules.dut = dut = CordicBasePipe(pspec)
 
+        for port in dut.ports():
+            print ("port", port)
+
+        vl = rtlil.convert(dut, ports=dut.ports())
+        with open("test_cordic_pipe_sin_cos.il", "w") as f:
+            f.write(vl)
+
         z = Signal(dut.p.data_i.z0.shape())
         z_valid = Signal()
         ready = Signal()
(END)
Comment 5 Luke Kenneth Casson Leighton 2020-04-13 18:34:18 BST
ok so would you like to do the next step? which is, yep you guessed it:
IEEE754 FP sin and cos :)

i think... from doing some reading (last year) i have a vague recollection
that you have to be careful with the smaller ranges at one end, thus
only using a 45 degree range (+/-) and subtracting and adding accordingly
in order to compute an intermediary result, followed by adjustment
(add/subtract to/from fractions of pi).

if you naively try to do from 0-90 then at one end (i forget which) because
sin/cos is flat at one end, something to do with the atan2 tables getting
very small, you end up with inaccuracies creeping in and stacking up
rapidly.
Comment 6 Michael Nolan 2020-04-15 18:54:43 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)
> ok so would you like to do the next step? which is, yep you guessed it:
> IEEE754 FP sin and cos :)

I'm not really sure how to do this though. I know you did a similar thing on the divider, did you handle it as a float the entire way through or convert it to integer... somehow?

Doing it as a float means that the adders/subtractors need to be full floating point adders, like those in fpadd/, right? 

For converting the inputs to integers, it seems like there would be issues as the input angle approached 0 (because floats give more resolution there). Cole emailed me a paper (which I linked to in the resources page) that contained a hybrid cordic that switched to a taylor series approximation when the input became small enough. This seems like it might work reasonably well for handling the small input case. 

Thoughts?
Comment 7 Luke Kenneth Casson Leighton 2020-04-15 20:14:07 BST
(In reply to Michael Nolan from comment #6)
> (In reply to Luke Kenneth Casson Leighton from comment #5)
> > ok so would you like to do the next step? which is, yep you guessed it:
> > IEEE754 FP sin and cos :)
> 
> I'm not really sure how to do this though. I know you did a similar thing on
> the divider, did you handle it as a float the entire way through or convert
> it to integer... somehow?

integer, and kept things at the same exponent.  i.e. we:

* adjusted the relative range of both numbers a and b so that they
  had the same exponent (-1 from exp, shift mantissa until exp(a)==exp(b)
* keeping enough bits (3x the IEEE754 length in some places) at all times
  so that there will be no loss of precision
* from that point on treated the mantissa as an integer, *completely ignoring*
  the exponent until
* finally at the end "re-normalising" the result.


> Doing it as a float means that the adders/subtractors need to be full
> floating point adders, like those in fpadd/, right? 

actually fpadd simply does the same trick above as well.  here's the original
jon dawson verilog code that we started from:

  https://github.com/dawsonjon/fpu/blob/master/adder/adder.v#L174

you can see there's no "actual" FPadds, it's literally, "line up the
exponents, do an integer add, then adjust/normalise the result".


> For converting the inputs to integers, it seems like there would be issues
> as the input angle approached 0 (because floats give more resolution there).

it's ok to have double (or even triple) the number of bits being used for the
fixed-integer computation.  so where the mantissa for FP32 is 23 bits, it's
ok to use 23*3 = 69 bits (plus a few more for rounding).


> Cole emailed me a paper (which I linked to in the resources page) that
> contained a hybrid cordic that switched to a taylor series approximation
> when the input became small enough. This seems like it might work reasonably
> well for handling the small input case. 

unfortunately, as we will (later) be using the CORDIC block for other purposes,
not just SIN/COS but as an actual opcode, if it doesn't do "actual CORDIC"
we're stuffed, there.

i suspect that it might be possible, as the size of the ATAN2 constants
gets smaller, to "shift" the result up slightly.

this we can do - test out - right now long before trying to move to FP.
each CordicStage can take an extra argument which specifies that dx, dy and
dz should be shifted by N extra bits.

the transfer of input to output needs to take into consideration that the
next stage is going to be shifted by N.

in this way, smaller numbers don't end up losing quite so many bits.
the only thing we have to make sure is that, throughout the whole pipeline,
there's enough extra bits such that the shifts do *not* end hitting the
ceiling of the maximum range of x (or y).
Comment 8 Luke Kenneth Casson Leighton 2020-04-15 20:16:26 BST
ah.

on other thing.

i don't really trust the standard c-library atan to be accurate enough.
i think we should use python bigfloat, here.  it can produce arbitrary
precision floating-point.

that way we can ensure that for FP64 CORDIC (64-bit sin/cos), by using
128-bit (or greater) atan2 tables, we get the full accuracy needed to
get the right answers for IEEE754 sin/cos FP64.

if we try to use the "standard" math atan2 library i guarantee we will
fail to get sufficient accuracy at the limits of the range.
Comment 9 Luke Kenneth Casson Leighton 2020-04-15 20:17:53 BST
cole wrote (on the list):
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-April/005993.html

Here's the link to the paper, which you've linked to on the resources
page. I just thought I'd include it in this bug/mailing list for ease of
access: "Low Latency and Low Error Floating-Point Sine/Cosine Function
Based TCORDIC Algorithm" (https://ieeexplore.ieee.org/document/7784797).
For project members that do not have IEEE explore access, please email
Michael or myself off-list.

I also think that this paper may be relevant, however, I am not entirely sure.
"Pipelined Range Reduction Based Truncated Multiplier" (https://ieeexplore.ieee.org/document/9062697).
Comment 10 Luke Kenneth Casson Leighton 2020-04-15 20:19:10 BST
(In reply to Luke Kenneth Casson Leighton from comment #9)
> cole wrote (on the list):

> "Pipelined Range Reduction Based Truncated Multiplier"
> (https://ieeexplore.ieee.org/document/9062697).

appreciated.  things like this are important to look at because we
can always learn something, even if it's "nice, however we're not
going to do it that way".
Comment 11 Michael Nolan 2020-04-15 20:34:39 BST
(In reply to Luke Kenneth Casson Leighton from comment #7)
> integer, and kept things at the same exponent.  i.e. we:
> 
> * adjusted the relative range of both numbers a and b so that they
>   had the same exponent (-1 from exp, shift mantissa until exp(a)==exp(b)
> * keeping enough bits (3x the IEEE754 length in some places) at all times
>   so that there will be no loss of precision
> * from that point on treated the mantissa as an integer, *completely
> ignoring*
>   the exponent until
> * finally at the end "re-normalising" the result.

Ah ok, this makes more sense. 


> unfortunately, as we will (later) be using the CORDIC block for other
> purposes,
> not just SIN/COS but as an actual opcode, if it doesn't do "actual CORDIC"
> we're stuffed, there.

Ah I see. So at some point we'll be wanting to use this for general rotations, hyperbolic sin/cos, and such? 

> 
> i suspect that it might be possible, as the size of the ATAN2 constants
> gets smaller, to "shift" the result up slightly.
> 
> this we can do - test out - right now long before trying to move to FP.
> each CordicStage can take an extra argument which specifies that dx, dy and
> dz should be shifted by N extra bits.

I think I sort of understand - you want to scale up dx, dy (and presumably x and y) when they're small so that they have more bits of precision. 

One thing that occurred to me - since power (and I assume riscv) don't include any sin/cos opcodes but instead use a function in libc to calculate them, the cordic would only be used by the GPU, right? If that's the case, do sin/cos necessarily need to be as accurate as the result from libc?
Comment 12 Cole Poirier 2020-04-15 20:54:43 BST
(In reply to Michael Nolan from comment #11)
> (In reply to Luke Kenneth Casson Leighton from comment #7)
> > unfortunately, as we will (later) be using the CORDIC block for other
> > purposes,
> > not just SIN/COS but as an actual opcode, if it doesn't do "actual CORDIC"
> > we're stuffed, there.
> 
> Ah I see. So at some point we'll be wanting to use this for general
> rotations, hyperbolic sin/cos, and such? 
> 
> > 
> > i suspect that it might be possible, as the size of the ATAN2 constants
> > gets smaller, to "shift" the result up slightly.
> > 
> > this we can do - test out - right now long before trying to move to FP.
> > each CordicStage can take an extra argument which specifies that dx, dy and
> > dz should be shifted by N extra bits.
> 
> I think I sort of understand - you want to scale up dx, dy (and presumably x
> and y) when they're small so that they have more bits of precision. 
> 
> One thing that occurred to me - since power (and I assume riscv) don't
> include any sin/cos opcodes but instead use a function in libc to calculate
> them, the cordic would only be used by the GPU, right? If that's the case,
> do sin/cos necessarily need to be as accurate as the result from libc?

Michael, here are some possibly relevant links to FPU accuracy... or rather inaccuracy. I've also added them to the resources page under "5.4 Past FPU Mistakes to learn from".

Intel Underestimates Error Bounds by 1.3 quintillion
Posted on October 9, 2014 to Random ASCII – tech blog of Bruce Dawson 

https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/

Intel overstates FPU accuracy 06/01/2013

http://notabs.org/fpuaccuracy/


Cole
Comment 13 Luke Kenneth Casson Leighton 2020-04-15 22:30:33 BST
(In reply to Michael Nolan from comment #11)
> (In reply to Luke Kenneth Casson Leighton from comment #7)

> > * from that point on treated the mantissa as an integer, *completely
> > ignoring*
> >   the exponent until
> > * finally at the end "re-normalising" the result.
> 
> Ah ok, this makes more sense. 

FP is odd, it can only be done in fixedpoint.  however it is actually relatively straightforward.  the only real tricky bits are that rounding rules and interactions can be seriously obtuse.

> 
> > unfortunately, as we will (later) be using the CORDIC block for other
> > purposes,
> > not just SIN/COS but as an actual opcode, if it doesn't do "actual CORDIC"
> > we're stuffed, there.
> 
> Ah I see. So at some point we'll be wanting to use this for general
> rotations, hyperbolic sin/cos, and such? 

yes.  a general purpose CORDIC engine with all possible modes (i think there are at least 3: linear, rotate and hyperbolic) and to make an instruction that can access aaall those modes...

...yet for sin, cos, log etc use that exact same engine by setting some of those parameters manually.

> > 
> > i suspect that it might be possible, as the size of the ATAN2 constants
> > gets smaller, to "shift" the result up slightly.
> > 
> > this we can do - test out - right now long before trying to move to FP.
> > each CordicStage can take an extra argument which specifies that dx, dy and
> > dz should be shifted by N extra bits.
> 
> I think I sort of understand - you want to scale up dx, dy (and presumably x
> and y) when they're small so that they have more bits of precision. 

correct.

however we cannot lose sight of how *many* shifts occurred (in each stage and cumulatively) because each shift would be a doubling (halving?) of the ultimate answer.

i honestly do not know if it is a good idea to have dynamic shifting or not. it would mean testing both x and y, but the problem is that might get small whilst y gets large.

keeping track of whether shifting occurs, it is actually an exponent! you double the mantissa (shift x/y up) however you need to *subtract* one from the exponent when doing so.

and because x can get large while y gets small and vice versa now you would need *two* exponents. worse than that, you need 2 copies of z because one z needs likewise shifting down and the other up.  too complicated.

hm.

i'm therefore tempted to suggest dropping one of x or y and setting a mode to decide whether to compute a sin or cos result (but not both).

see how it goes.


> One thing that occurred to me - since power (and I assume riscv) don't
> include any sin/cos opcodes but instead use a function in libc to calculate
> them, the cordic would only be used by the GPU, right?

well if we propose the opcodes for fpsin and fpcos and they are accepted by the OpenPOWER Foundation for a future version of POWER then yes some hardware would call the opcodes whilst others would emulate it using libc.  or libm.

> If that's the case,
> do sin/cos necessarily need to be as accurate as the result from libc?

we need to be Vulkan compliant however i think as long as it does not use massive amounts of power we can do IEEE754 compliance.

if you mean testing *against* libc (libm), i honestly do not know if libm is fully IEEE754 compliant and that's why i suggested testing against python bigfloat for now.

later we will definitely need to use jacob's algorithmic library, to make sure we have full compliance at the bit level.
Comment 14 Luke Kenneth Casson Leighton 2020-04-16 08:53:53 BST
btw that "close to zero" is i think what wad meant about altering the range and adding fractions of pi to the result.

sin x + 90 equals cos x

therefore in the ranges where either gets close to zero you use the OTHER version.
Comment 15 Jacob Lifshay 2020-04-16 09:53:18 BST
normally, for implementing sin, you specifically want to try to switch to inputs close to zero rather than away, since that allows retaining much higher accuracy, since for small x: sin(x) ≈ x (≈ is the approximately equal operator).
Comment 16 Luke Kenneth Casson Leighton 2020-04-16 10:48:19 BST
(In reply to Jacob Lifshay from comment #15)
> normally, for implementing sin, you specifically want to try to switch to
> inputs close to zero rather than away, since that allows retaining much
> higher accuracy, since for small x: sin(x) ≈ x (≈ is the approximately equal
> operator).

interesting, how we will be able to match that up with CORDIC.
Comment 17 Michael Nolan 2020-04-16 16:36:03 BST
I'm working on converting the floating point input to the fixed point value needed by the cordic. I've got my module decoding the float into sign, exponent, and mantissa with FPNumDecode, and the mantissa is missing the (implied) leading 1 bit. I'm also looking at the modules of fpdiv (div0.py and nmigen_div_experiment.py), and they don't seem to add that bit back in. Is there a way to have FPNumDecode generate a mantissa with the leading 1 bit?
Comment 18 Jacob Lifshay 2020-04-16 17:22:59 BST
(In reply to Michael Nolan from comment #17)
> I'm working on converting the floating point input to the fixed point value
> needed by the cordic. I've got my module decoding the float into sign,
> exponent, and mantissa with FPNumDecode, and the mantissa is missing the
> (implied) leading 1 bit. I'm also looking at the modules of fpdiv (div0.py
> and nmigen_div_experiment.py), and they don't seem to add that bit back in.
> Is there a way to have FPNumDecode generate a mantissa with the leading 1
> bit?

I think that's done in ieee754.fpcommon.denorm:
https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpcommon/denorm.py;h=350f413ede9858ae490059f6709b3454ea8a7f1e;hb=HEAD
Comment 19 Luke Kenneth Casson Leighton 2020-04-16 17:46:58 BST
(In reply to Jacob Lifshay from comment #18)
> (In reply to Michael Nolan from comment #17)
> > I'm working on converting the floating point input to the fixed point value
> > needed by the cordic. I've got my module decoding the float into sign,
> > exponent, and mantissa with FPNumDecode, and the mantissa is missing the
> > (implied) leading 1 bit. I'm also looking at the modules of fpdiv (div0.py
> > and nmigen_div_experiment.py), and they don't seem to add that bit back in.
> > Is there a way to have FPNumDecode generate a mantissa with the leading 1
> > bit?
> 
> I think that's done in ieee754.fpcommon.denorm:
> https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpcommon/
> denorm.py;h=350f413ede9858ae490059f6709b3454ea8a7f1e;hb=HEAD

line 41, yes.

ok.

i only sort-of "understood" IEEE754 FP retrospectively, through long exposure,
having started initially from jon dawson's "working" verilog code and making
absolutely damn sure, throughout the entire six months of morphing it, that
it passed the unit tests *at all times*.  this made for a bit of a mess at
times.

i therefore substituted "unit tests pass 100%" for "understanding what the
hell is really going on".

line 41.

https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpcommon/denorm.py;h=350f413ede9858ae490059f6709b3454ea8a7f1e;hb=HEAD#l41

here you can see that the top bit of a (and later b) are, yes, set to 1
if and only if the exponent is not equal to -126 [for FP32]

back-track a little: the data structure FPSCData has had one *extra* bit
already allocated at the top (m[-1]) which is *specifically* allocated
for the purpose of allowing that extra implicit mantissa "1" to be
inserted *if it is needed*.

the reason why it is sometimes not needed is this:

when the exponent is "-126" (on FP32), this is "ridiculously small number"
but it is not the limit of the range of IEEE754.

what they allowed was: when exp=-126, you can express smaller and smaller
numbers by simply having less bits in the *mantissa*.  obviously, those
numbers become less and less accurate however this is an acceptable tradeoff.

however when you do that, you have to go from "implicit 1 in the top digit"
to "explicit 1 in the mantissa itself".

thus, we have that Mux detecting the transition condition between these
"implicit 1" and "really small numbers with explicit 1 *ALREADY* in
the mantissa".

the whole process is called "de-normalisation".

so basically, you can just "trust" that after the number has gone through
FPAddDeNorm, it *will* have the mantissa and exponent correctly set, such
that you can treat that "expanded mantissa" as just "A Fixed Integer".

btw i don't recommend trying multiplying or dividing either the input
or the output by pi inside the FP CORDIC code. we can use micro-coding for
that (and in the unit tests do the divide and/or multiply using import math
or import bigfloat).

so let us assume that we're implementing cospi and sinpi, here, *NOT*
actual cos or actual sin, ok?
Comment 20 Jacob Lifshay 2020-04-16 17:59:07 BST
(In reply to Luke Kenneth Casson Leighton from comment #19)
> btw i don't recommend trying multiplying or dividing either the input
> or the output by pi inside the FP CORDIC code. we can use micro-coding for
> that (and in the unit tests do the divide and/or multiply using import math
> or import bigfloat).
> 
> so let us assume that we're implementing cospi and sinpi, here, *NOT*
> actual cos or actual sin, ok?

That will allow us to have a much more accurate cospi and sinpi because we don't need multiple conversions. Also, since cospi and sinpi have a period of 2 which is a power of 2, it is super easy to do range reduction -- just some bit masking.
Comment 21 Michael Nolan 2020-04-16 18:16:30 BST
(In reply to Jacob Lifshay from comment #18)

> 
> I think that's done in ieee754.fpcommon.denorm:
> https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpcommon/
> denorm.py;h=350f413ede9858ae490059f6709b3454ea8a7f1e;hb=HEAD

Ah, thank you!
Comment 22 Luke Kenneth Casson Leighton 2020-04-17 13:29:40 BST
michael you may find this extremely informative to experiment with:

http://weitz.de/ieee/
Comment 23 Luke Kenneth Casson Leighton 2020-04-17 18:19:36 BST
argh, michael, python3-gmpy2 is not properly packaged on debian.

i can't even *install* python3-gmpy2 for python3.7!

 apt-get install python3-gmpy2=2.1.0~a4-1
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 python3-gmpy2 : Depends: python3 (< 3.8) but 3.8.2-3 is to be installed
E: Unable to correct problems, you have held broken packages.
Comment 24 Luke Kenneth Casson Leighton 2020-04-17 18:28:44 BST
https://pypi.org/project/bigfloat/

i used python bigfloat about... 8 years ago, and just updated it:
it works well.  it also happens to work in a similar way to
python-gmpy2 by cython-wrapping mpfr... except that the debian
package maintainer for python-gmpy2 is not experienced enough
to do multi-python-version compiles.

python-bigfloat i specifically recommended because it has been
"wrapped" with additional functionality that makes it near-identical
to both the standard python "math" module *and* to the standard
python "float".

    from bigfloat import cos, sin, tan

therefore works
Comment 25 Jacob Lifshay 2020-04-17 18:31:57 BST
I can add sin, cos, tan, etc. support to simple-soft-float.
Comment 26 Michael Nolan 2020-04-17 19:02:42 BST
(In reply to Luke Kenneth Casson Leighton from comment #24)
> https://pypi.org/project/bigfloat/
> 
> i used python bigfloat about... 8 years ago, and just updated it:
> it works well.  it also happens to work in a similar way to
> python-gmpy2 by cython-wrapping mpfr... except that the debian
> package maintainer for python-gmpy2 is not experienced enough
> to do multi-python-version compiles.

Ok, I reverted the commit containing the gmpy stuff and added a similar one calculating the atan tables with bigfloat
Comment 27 Luke Kenneth Casson Leighton 2020-04-17 19:37:55 BST
(In reply to Michael Nolan from comment #26)
> (In reply to Luke Kenneth Casson Leighton from comment #24)
> > https://pypi.org/project/bigfloat/
> > 
> > i used python bigfloat about... 8 years ago, and just updated it:
> > it works well.  it also happens to work in a similar way to
> > python-gmpy2 by cython-wrapping mpfr... except that the debian
> > package maintainer for python-gmpy2 is not experienced enough
> > to do multi-python-version compiles.
> 
> Ok, I reverted the commit containing the gmpy stuff and added a similar one
> calculating the atan tables with bigfloat

with the debian package being a mess that's probably the sensible option
for now.

there's a trick btw, i think it's:

with bf.precision(300): # use 300 bits of precision
    x = bf.atan(BigFloat(2) ....)

which is why i recommended bigfloat: you can specify extremely large
FP precision.
Comment 28 Jacob Lifshay 2020-04-17 19:38:34 BST
(In reply to Michael Nolan from comment #26)
> Ok, I reverted the commit containing the gmpy stuff and added a similar one
> calculating the atan tables with bigfloat

Please also add the libgmp-dev package (or whatever the correct name is) to .gitlab-ci.yml
Comment 29 Michael Nolan 2020-04-17 19:42:33 BST
(In reply to Luke Kenneth Casson Leighton from comment #27)
> with bf.precision(300): # use 300 bits of precision
>     x = bf.atan(BigFloat(2) ....)
> 
> which is why i recommended bigfloat: you can specify extremely large
> FP precision.

Yup. I went with quad precision for now.


(In reply to Jacob Lifshay from comment #28)
> Please also add the libgmp-dev package (or whatever the correct name is) to
> .gitlab-ci.yml

Done.
Comment 30 Luke Kenneth Casson Leighton 2020-04-17 20:36:31 BST
it's sorted now however i just found this:  python3-mpmath
it's a pure-python-only arbitrary-precision FP math library.  it doesn't
say whether it's IEEE754 compliant however.
Comment 31 Jacob Lifshay 2020-04-17 20:56:05 BST
(In reply to Luke Kenneth Casson Leighton from comment #30)
> it's sorted now however i just found this:  python3-mpmath
> it's a pure-python-only arbitrary-precision FP math library.  it doesn't
> say whether it's IEEE754 compliant however.

ieee754 compliant basically means the results are rounded properly to an ieee754 binary/decimal format, mpmath is instead designed to give as many digits of accuracy as you want, it's designed to emulate mathematical real/complex numbers rather than ieee 754 arithmetic.
Comment 32 Jacob Lifshay 2020-04-17 20:59:16 BST
(In reply to Michael Nolan from comment #29)
> (In reply to Jacob Lifshay from comment #28)
> > Please also add the libgmp-dev package (or whatever the correct name is) to
> > .gitlab-ci.yml
> 
> Done.

You used tabs instead of spaces, which broke the yaml parser, please fix.
Comment 33 Luke Kenneth Casson Leighton 2020-04-18 22:52:58 BST
rats.  i just realised, the dynamic shifting idea isn't going to work.  the problem is that the atan table value may be small, whilst the x/y large, *or vice versa*.

so this needs large shifters and complex logic, i dont think its worth it.
Comment 34 Luke Kenneth Casson Leighton 2020-04-20 22:51:39 BST
this may also help to understand IEEE754 FP
https://imgs.xkcd.com/comics/garbage_math.png
Comment 35 Michael Nolan 2020-05-05 14:17:18 BST
So I think I got the floating point cordic all the way working yesterday. It converts the floating point input to a fixed point 50 bit number, calculates the sine and cosine, then converts them back to floating point.

However, for the conversion back to floating point I had to write my own class to do so, as none of the existing ones (used in fpadd for instance) support *two* outputs. I don't imagine we want to continue using my implementation, so what would be the best way of going about this?

I also needed to write a module for counting the number of leading zeros in the fixed point number. I don't imagine this would only be useful to the cordic; where would be the best place to put it? fpcommon?
Comment 36 Luke Kenneth Casson Leighton 2020-05-05 15:04:06 BST
(In reply to Michael Nolan from comment #35)
> So I think I got the floating point cordic all the way working yesterday. It
> converts the floating point input to a fixed point 50 bit number, calculates
> the sine and cosine, then converts them back to floating point.

hurrah!

(just running the tests, you forgot to commit cordic/renormalize.py.
 for future reference, use "git status" and you should spot missing
 files, fairly easily.  that's if we're properly maintaining ".gitignore")


> However, for the conversion back to floating point I had to write my own
> class to do so, as none of the existing ones (used in fpadd for instance)
> support *two* outputs. 

yeah, keenly aware of that.

> I don't imagine we want to continue using my
> implementation, so what would be the best way of going about this?

hmmm hmmm well we will actually need double-output functions at some
point (and more): NORMALISE, DOTPRODUCT, they're all multi-output.

however for the plain "SIN" and "COS" operations - and others that
will use "General CORDIC" - we simply *drop* the output on the floor
if it's unused.

if however we do macro-op fusion, spotting two (commonly-used)
operations together, the two results *will* be needed.


> I also needed to write a module for counting the number of leading zeros in
> the fixed point number. 

ah goood.  actually i think... somewhere... very ineptly-named, there
is a class that does this.  probably "MultiShiftRMerge".

don't worry about it :)

> I don't imagine this would only be useful to the
> cordic; where would be the best place to put it? fpcommon?

on one hand i'd be inclined to suggest nmutil (because CLZ is pretty
useful, generally)

on the other, it's a basic "ALU" operation.

hmm hmm...  the decision would be based then on whether CLZ (count leading
zeros) has uses *outside* of an ALU, and my feeling is, it probably does.

actually, just looking at it, i really like it!  it reminds me of the
reference that Jacob put... what was it... "Common Task something"?
Comment 37 Luke Kenneth Casson Leighton 2020-05-05 15:06:42 BST
ah excellent!  you did a FSM-based version as well, that's superb.
if the pipelined version turns out to be insanely gate-hungry, or
just not that important, we can use it.

also as a "general-purpose library" it's extremely useful to have
because other developers might not want a full pipeline version.
Comment 38 Michael Nolan 2020-05-05 15:29:33 BST
(In reply to Luke Kenneth Casson Leighton from comment #36)
> 
> (just running the tests, you forgot to commit cordic/renormalize.py.
>  for future reference, use "git status" and you should spot missing
>  files, fairly easily.  that's if we're properly maintaining ".gitignore")

Oops, fixed!
> 
> 
> > However, for the conversion back to floating point I had to write my own
> > class to do so, as none of the existing ones (used in fpadd for instance)
> > support *two* outputs. 
> 
> yeah, keenly aware of that.
> 
> > I don't imagine we want to continue using my
> > implementation, so what would be the best way of going about this?
> 
> hmmm hmmm well we will actually need double-output functions at some
> point (and more): NORMALISE, DOTPRODUCT, they're all multi-output.
> 
> however for the plain "SIN" and "COS" operations - and others that
> will use "General CORDIC" - we simply *drop* the output on the floor
> if it's unused.
> 
> if however we do macro-op fusion, spotting two (commonly-used)
> operations together, the two results *will* be needed.

Ah ok, so maybe I should add an input that selects one or the other? (kinda sucks that we need to calculate both for the cordic to work, oh well)


(In reply to Luke Kenneth Casson Leighton from comment #37)
> ah excellent!  you did a FSM-based version as well, that's superb.
> if the pipelined version turns out to be insanely gate-hungry, or
> just not that important, we can use it.

Unfortunately, that one isn't as complete as the pipelined one. It doesn't handle denormals or 0 well because I had to roll my own denormalization stage instead of using the existing one from the multiplier. And it doesn't (yet) support conversion back into floating point at the end. 

> 
> also as a "general-purpose library" it's extremely useful to have
> because other developers might not want a full pipeline version.

Makes sense.
Comment 39 Michael Nolan 2020-05-05 15:42:02 BST
(In reply to Luke Kenneth Casson Leighton from comment #37)
> ah excellent!  you did a FSM-based version as well, that's superb.
> if the pipelined version turns out to be insanely gate-hungry, or
> just not that important, we can use it.

Ooof, yeah. Running synthesis on the 32 bit pipelined version (4 cordic rounds per pipeline stage) gives somewhere around 77k gates.
Comment 40 Michael Nolan 2020-05-05 15:52:35 BST
Luke, your latest commit (74af221) breaks the tests for the pipelined fpsin:

======================================================================
ERROR: test_rand (__main__.SinCosTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_fp_pipe.py", line 80, in test_rand
    self.run_test(iter(inputs), outputs=iter(outputs))
  File "test_fp_pipe.py", line 21, in run_test
    vl = rtlil.convert(dut, ports=dut.ports())
  File "/home/mnolan/.local/lib/python3.8/site-packages/nmigen/back/rtlil.py", line 1017, in convert
    fragment = ir.Fragment.get(elaboratable, platform).prepare(**kwargs)
  File "/home/mnolan/.local/lib/python3.8/site-packages/nmigen/hdl/ir.py", line 540, in prepare
    raise TypeError("Only signals may be added as ports, not {!r}"
TypeError: Only signals may be added as ports, not (slice (sig x) 0:1)

It looks like dut.ports() is returning a bunch of slices for the output signals and I'm not sure why:

[(sig p_valid_i), (sig p_ready_o), (sig a), (sig muxid), (sig op), (sig n_ready_i), (sig n_valid_o), (slice (sig x) 0:1), (slice (sig x) 1:2), (slice (sig x) 2:3), (slice (sig x) 3:4), (slice (sig x) 4:5), (slice (sig x) 5:6), (slice (sig x) 6:7), (slice (sig x) 7:8), (slice (sig x) 8:9), (slice (sig x) 9:10), (slice (sig x) 10:11), (slice (sig x) 11:12), (slice (sig x) 12:13), (slice (sig x) 13:14), (slice (sig x) 14:15), (slice (sig x) 15:16), (slice (sig x) 16:17), (slice (sig x) 17:18), (slice (sig x) 18:19), (slice (sig x) 19:20), (slice (sig x) 20:21), (slice (sig x) 21:22), (slice (sig x) 22:23), (slice (sig x) 23:24), (slice (sig x) 24:25), (slice (sig x) 25:26), (slice (sig x) 26:27), (slice (sig x) 27:28), (slice (sig x) 28:29), (slice (sig x) 29:30), (slice (sig x) 30:31), (slice (sig x) 31:32), (slice (sig y) 0:1), (slice (sig y) 1:2), (slice (sig y) 2:3), (slice (sig y) 3:4), (slice (sig y) 4:5), (slice (sig y) 5:6), (slice (sig y) 6:7), (slice (sig y) 7:8), (slice (sig y) 8:9), (slice (sig y) 9:10), (slice (sig y) 10:11), (slice (sig y) 11:12), (slice (sig y) 12:13), (slice (sig y) 13:14), (slice (sig y) 14:15), (slice (sig y) 15:16), (slice (sig y) 16:17), (slice (sig y) 17:18), (slice (sig y) 18:19), (slice (sig y) 19:20), (slice (sig y) 20:21), (slice (sig y) 21:22), (slice (sig y) 22:23), (slice (sig y) 23:24), (slice (sig y) 24:25), (slice (sig y) 25:26), (slice (sig y) 26:27), (slice (sig y) 27:28), (slice (sig y) 28:29), (slice (sig y) 29:30), (slice (sig y) 30:31), (slice (sig y) 31:32), (sig muxid), (sig op)]
Comment 41 Luke Kenneth Casson Leighton 2020-05-05 15:53:32 BST
(In reply to Michael Nolan from comment #38)
> (In reply to Luke Kenneth Casson Leighton from comment #36)
> > 
> > (just running the tests, you forgot to commit cordic/renormalize.py.
> >  for future reference, use "git status" and you should spot missing
> >  files, fairly easily.  that's if we're properly maintaining ".gitignore")
> 
> Oops, fixed!

been there :)

> > if however we do macro-op fusion, spotting two (commonly-used)
> > operations together, the two results *will* be needed.
> 
> Ah ok, so maybe I should add an input that selects one or the other?

yes.  an "Op".  this is what the "Op" field is for... see... ahh.. 1sec...
DivPipeCoreConfig and DivPipeCoreOperation (etc.)

https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/div_rem_sqrt_rsqrt/core.py;h=fea59dce0c4f73d652985faa8181d5af97d5877c;hb=dd3919a0df8e1190f2fe985b0448701f164b392a#l54

which is set up as the "context" (ctx) here:
https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/div_rem_sqrt_rsqrt/div_pipe.py;h=d7b81729499b5d763eb1d150b031dceeefecbda9;hb=dd3919a0df8e1190f2fe985b0448701f164b392a#l42

and it is of type "FPPipeContext", declared here:
https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpcommon/getop.py;h=117d29223d23ab8176382e76b2dd129382735721;hb=dd3919a0df8e1190f2fe985b0448701f164b392a#l65

and if you look at lines 63 and 65 you can see that you can EITHER:

* specify just a bitwidth (in which case you get a signal and it's
  your responsibility to manage conversion from the "op" into that Signal)

OR

* specify an *actual class* as a parameter to FPPipeContext, and an instance
  *of* that class will be instantiated on your behalf.

it's actually really important to conform to this API because it's where
the "cancellation", identifier information etc. critically necessary for
the use of the pipeline in the scoreboard system is all placed.



> (kinda
> sucks that we need to calculate both for the cordic to work, oh well)

:)

divide is the same.  it produces two results: DIV and REM.

fascinatingly CORDIC and DIV (and SQRT and RSQRT) all use the same
core concept / algorithm, so it is no surprise that there are two results.

 
> (In reply to Luke Kenneth Casson Leighton from comment #37)
> > ah excellent!  you did a FSM-based version as well, that's superb.
> > if the pipelined version turns out to be insanely gate-hungry, or
> > just not that important, we can use it.
> 
> Unfortunately, that one isn't as complete as the pipelined one. It doesn't
> handle denormals or 0 well because I had to roll my own denormalization
> stage instead of using the existing one from the multiplier. And it doesn't
> (yet) support conversion back into floating point at the end. 

ok not a problem.  it got you bootstrapped to where you needed to go.
let's leave it for now, but do document that (briefly).

> > 
> > also as a "general-purpose library" it's extremely useful to have
> > because other developers might not want a full pipeline version.
> 
> Makes sense.

can you do that move?  the unit test as well.
Comment 42 Luke Kenneth Casson Leighton 2020-05-05 15:55:26 BST
(In reply to Michael Nolan from comment #40)
> Luke, your latest commit (74af221) breaks the tests for the pipelined fpsin:

yep i commented it out for now (and explained why!)

> ======================================================================
> ERROR: test_rand (__main__.SinCosTestCase)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "test_fp_pipe.py", line 80, in test_rand
>     self.run_test(iter(inputs), outputs=iter(outputs))
>   File "test_fp_pipe.py", line 21, in run_test
>     vl = rtlil.convert(dut, ports=dut.ports())
>   File
> "/home/mnolan/.local/lib/python3.8/site-packages/nmigen/back/rtlil.py", line
> 1017, in convert
>     fragment = ir.Fragment.get(elaboratable, platform).prepare(**kwargs)
>   File "/home/mnolan/.local/lib/python3.8/site-packages/nmigen/hdl/ir.py",
> line 540, in prepare
>     raise TypeError("Only signals may be added as ports, not {!r}"
> TypeError: Only signals may be added as ports, not (slice (sig x) 0:1)
> 
> It looks like dut.ports() is returning a bunch of slices for the output
> signals and I'm not sure why:

yep it means that in the data_i or data_o you've missed a "yield from"
and used "yield" instead.  or vice-versa.

"yield from x" would iterate *through x* and that tells nmigen "please
return slices of single-bits of x".

i'll take a look
Comment 43 Michael Nolan 2020-05-05 15:58:40 BST
(In reply to Luke Kenneth Casson Leighton from comment #41)
> 
> yes.  an "Op".  this is what the "Op" field is for... see... ahh.. 1sec...
> DivPipeCoreConfig and DivPipeCoreOperation (etc.)
> 
Ok 
> 
> it's actually really important to conform to this API because it's where
> the "cancellation", identifier information etc. critically necessary for
> the use of the pipeline in the scoreboard system is all placed.

Oh hold on, I thought that cancellation bit was for the pipeline to produce a result early. Since the cordic doesn't do that, I left it out. It sounds like I need to add that back in huh?

> 
> can you do that move?  the unit test as well.

Sorry, you mentioned nmutil and never said which place you thought it should go. nmutil or fpcommon?
Comment 44 Luke Kenneth Casson Leighton 2020-05-05 16:00:32 BST
found it.  the code was doing exactly as you'd asked: walking through
self.x and self.y by iteration.

"yield from" will treat the object *as* an iterator, returning calls
to object.__iter__() and returning those objects in sequence, as
*MULTIPLE* results.

"yield" will return the OBJECT - once and only once as *THE* (one and
only) result.

if the Signal did not have an __iter__ function, you would have got an
error.


diff --git a/src/ieee754/cordic/pipe_data.py b/src/ieee754/cordic/pipe_data.py
index c3f600a..d83dbe7 100644
--- a/src/ieee754/cordic/pipe_data.py
+++ b/src/ieee754/cordic/pipe_data.py
@@ -28,8 +28,8 @@ class CordicOutputData:
         self.muxid = self.ctx.muxid
 
     def __iter__(self):
-        yield from self.x
-        yield from self.y
+        yield self.x
+        yield self.y
         yield from self.ctx
 
     def eq(self, i):
Comment 45 Michael Nolan 2020-05-05 16:00:48 BST
(In reply to Luke Kenneth Casson Leighton from comment #42)
> 
> yep it means that in the data_i or data_o you've missed a "yield from"
> and used "yield" instead.  or vice-versa.
> 
> "yield from x" would iterate *through x* and that tells nmigen "please
> return slices of single-bits of x".
> 
> i'll take a look

Aha! pipe_data.py has the following:

    def __iter__(self):
        yield from self.x
        yield from self.y
        yield from self.ctx

I'll try changing those to yields and see if that works
Comment 46 Luke Kenneth Casson Leighton 2020-05-05 16:10:21 BST
(In reply to Michael Nolan from comment #43)
> (In reply to Luke Kenneth Casson Leighton from comment #41)
> > 
> > yes.  an "Op".  this is what the "Op" field is for... see... ahh.. 1sec...
> > DivPipeCoreConfig and DivPipeCoreOperation (etc.)
> > 
> Ok 
> > 
> > it's actually really important to conform to this API because it's where
> > the "cancellation", identifier information etc. critically necessary for
> > the use of the pipeline in the scoreboard system is all placed.
> 
> Oh hold on, I thought that cancellation bit was for the pipeline to produce
> a result early. Since the cordic doesn't do that, I left it out. It sounds
> like I need to add that back in huh?

yyup :)

all operations need to be "cancellable" by way of a globally-broadcast
mask, equal in width to the number of Concurrent Computation Unit "front-ends".
see https://libre-soc.org/3d_gpu/architecture/6600scoreboard/

"Concurrent Computation Unit" diagram (which is from Mitch's Scoreboard
Mechanics, 2nd chapter).

so there will be... how many FP stages? 8? eek!  12! :) so we will need
*TWELVE* Function Unit front-ends, one to manage and track every single
result, and "re-match" that result *BACK* to the FU when it pops out.

now, we _could_ not have cancellation.  we could instead do this:

* let the result come out and when it arrives, throw it away.

  this means we have a Function Unit completely wasted, blocked and
  unusable for up to 12 cycles

* create a complex "ID" system (at least double the number of FU frontends)

  not a big fan of this method.  a counter wraps around, the ID associated
  with "new" requests, and is cleared to zero if the result is not to
  be put into the result FU buffer.


the simpler way is: have an unary "broadcast" mask that goes right the
way through the whole pipeline (one bit per "FU") and we just... simply...
pull that bit.

result dies.

FunctionUnit can rest.

job done.


so yes, you'll need to put that back in.  by conforming to the API,
you should *NOT* have to worry about whether the data is cancelled or
not, because the "MaskCancellable" Pipeline Mix-In class takes care
of that *automatically*.

but.

if FPPipeContext is *not used*... that is *not possible to do*.


> 
> > 
> > can you do that move?  the unit test as well.
> 
> Sorry, you mentioned nmutil and never said which place you thought it should
> go. nmutil or fpcommon?

oh sorry.  nmutil.
Comment 47 Luke Kenneth Casson Leighton 2020-05-05 16:10:57 BST
(In reply to Michael Nolan from comment #45)

> Aha! pipe_data.py has the following:
> 
>     def __iter__(self):
>         yield from self.x
>         yield from self.y
>         yield from self.ctx
> 
> I'll try changing those to yields and see if that works

done already.  git pull.  sorry! :)
Comment 48 Michael Nolan 2020-05-05 16:13:16 BST
(In reply to Luke Kenneth Casson Leighton from comment #46)

> 
> oh sorry.  nmutil.

Ok!

(In reply to Luke Kenneth Casson Leighton from comment #47)
> (In reply to Michael Nolan from comment #45)
> 
> > Aha! pipe_data.py has the following:
> > 
> >     def __iter__(self):
> >         yield from self.x
> >         yield from self.y
> >         yield from self.ctx
> > 
> > I'll try changing those to yields and see if that works
> 
> done already.  git pull.  sorry! :)

Yep, saw that after I posted my comment :P
Comment 49 Luke Kenneth Casson Leighton 2020-05-05 17:03:05 BST
(In reply to Michael Nolan from comment #39)
> (In reply to Luke Kenneth Casson Leighton from comment #37)
> > ah excellent!  you did a FSM-based version as well, that's superb.
> > if the pipelined version turns out to be insanely gate-hungry, or
> > just not that important, we can use it.
> 
> Ooof, yeah. Running synthesis on the 32 bit pipelined version (4 cordic
> rounds per pipeline stage) gives somewhere around 77k gates.

uh-huhn :)

so you see why i thought it might be a teeeny tiny bit of a good idea
to make it as general-purpose as possible?  used for a batch of
operations not just SIN/COS.

and why most GPUs have a massive concerted effort to shrink transcendentals
as much as possible.

Mitch's patents (bought by Samsung) actually use... what is it...
they use cubic equations, which is extremely efficient and, more than
that, accurate to >1.0 ULP (unit in last place) for the majority
of FP32 operations.

he gets about a quarter of the gate count of other SIN/COS hard macros.

if the gate count is too massive for FP64, we may *have* to use the
FSM-based version for FP64.  the priority as a GPU is on FP32.
Comment 50 Luke Kenneth Casson Leighton 2020-05-05 17:12:25 BST
btw 12 stages is too many, because we need a minimum of one FunctionUnit
"managing" (matching up) source operands and results.

each extra Function Unit added creates an Order N^2 increase in the overall
size of the Dependency Matrices.

so - and i don't mean right now - if this can be either:

* cut into two separate pipelines (partial data fed back as a micro-op)
* number of combinatorial blocks increased to 5 or 6

we stand a chance of keeping the FU count down (8 is ok).

have to think that through carefully, how to do micro-ops.
Comment 51 Luke Kenneth Casson Leighton 2020-05-05 17:16:00 BST
the other option:

keep the long pipeline lengths: we *know* that there's more stages than
there are FUs, and that's just "tough luck".  we know that with only
8 FUs and 12 pipeline stages, 4 of those at any one time will run
empty, and we just... live with that.
Comment 52 Michael Nolan 2020-05-05 17:28:32 BST
(In reply to Luke Kenneth Casson Leighton from comment #50)
> btw 12 stages is too many, because we need a minimum of one FunctionUnit
> "managing" (matching up) source operands and results.
> 
> each extra Function Unit added creates an Order N^2 increase in the overall
> size of the Dependency Matrices.
> 
> so - and i don't mean right now - if this can be either:
> 
> * cut into two separate pipelines (partial data fed back as a micro-op)
> * number of combinatorial blocks increased to 5 or 6
> 
> we stand a chance of keeping the FU count down (8 is ok).
> 
> have to think that through carefully, how to do micro-ops.

Sure. I don't know about splitting it up into multiple uops, but reducing the number of stages is pretty easy. I added a parameter to the module called rounds_per_stage which governs how many cordic rounds go in each pipeline stage. Increasing that number will decrease the number of pipeline stages. 

(In reply to Luke Kenneth Casson Leighton from comment #51)
> the other option:
> 
> keep the long pipeline lengths: we *know* that there's more stages than
> there are FUs, and that's just "tough luck".  we know that with only
> 8 FUs and 12 pipeline stages, 4 of those at any one time will run
> empty, and we just... live with that.

I was going to ask if we could do something like this. Since I don't think the cordic will be used *that* often, this seems like a reasonable thing to do.
Comment 53 Luke Kenneth Casson Leighton 2020-05-05 18:02:11 BST
(In reply to Michael Nolan from comment #52)
> (In reply to Luke Kenneth Casson Leighton from comment #50)
> > btw 12 stages is too many, because we need a minimum of one FunctionUnit
> > "managing" (matching up) source operands and results.
> > 
> > each extra Function Unit added creates an Order N^2 increase in the overall
> > size of the Dependency Matrices.
> > 
> > so - and i don't mean right now - if this can be either:
> > 
> > * cut into two separate pipelines (partial data fed back as a micro-op)
> > * number of combinatorial blocks increased to 5 or 6
> > 
> > we stand a chance of keeping the FU count down (8 is ok).
> > 
> > have to think that through carefully, how to do micro-ops.
> 
> Sure. I don't know about splitting it up into multiple uops,

neither do i at the moment!  i think i might have got confused about
sharing the MUL unit with FPMUL.

>  but reducing
> the number of stages is pretty easy. I added a parameter to the module
> called rounds_per_stage which governs how many cordic rounds go in each
> pipeline stage. Increasing that number will decrease the number of pipeline
> stages.

fantastic, that's exactly the kind of parameterisation _we_ need,
however as a general-purpose IEEE754 FP library, it's exactly
the kind of parameterisation that other _users_ need :)

> 
> (In reply to Luke Kenneth Casson Leighton from comment #51)
> > the other option:
> > 
> > keep the long pipeline lengths: we *know* that there's more stages than
> > there are FUs, and that's just "tough luck".  we know that with only
> > 8 FUs and 12 pipeline stages, 4 of those at any one time will run
> > empty, and we just... live with that.
> 
> I was going to ask if we could do something like this. Since I don't think
> the cordic will be used *that* often, this seems like a reasonable thing to
> do.

yehyeh.

ok what else can we throw in here?  LOG1P is probably a good thing to try

this is a great explanation:
http://www.mclenegan.com/public/thesis.pdf

section 4.4.1 explains "rotation mode"

however we also need *vector* mode (4.4.2)

i know - somewhere - i've seen a python implementation of vector mode.
it also specified, as you can see, the three other modes: circular,
linear, hyperbolic.

ah!  i know why we want to keep the two answers: see p39 of that
thesis: you can use them (with some post-processing) to do
tan.

also, see circular-vectoring mode, it ends up computing the
normal distance and... and the angle?  ohh, that's for... darn
what was it... it's for arcs.  there's a name for this... sorry
i forget what it's called right now.

https://www.ijert.org/vhdl-implementation-of-cordic-algorithm-for-wireless-lan

this version, p14, explains much clearer that "rotate" mode goes
by angle and computes x,y

"vector" mode the vector is rotated to be flat against the x-axis,
recording the angle (Z) needed to do so.

this one is a survey of CORDIC algorithms:
http://www.ee.ic.ac.uk/pcheung/teaching/ee3_DSD/crdcsrvy.pdf

it shows - *briefly* - how to use them to do different things, but
it also, by being a survey, has some really useful insights into
various optimisations.

so can i suggest, first adding the infrastructure back in to allow
cancellation and operations (ctx), then see if you can add one extra
"mode" (the m=-1, m=0, m=1 thing seems trivial which will give the
circular, linear and hyperbolic modes, respectively).

"circular" mode is covered with the output being SIN/COS.

ah ha!  here's a useful suite of implementations (annoyingly being
in jupiter)
https://github.com/suyashmahar/cordic-algorithm-python/blob/master/cordic_implementation.ipynb

extractign the two modes:


circular = 1
linear = 0
hyperbolic = -1



def ROM_lookup(iteration, coordinate):
    if (coordinate == circular):
        return math.degrees(math.atan(2**(-1*iteration)))
    elif (coordinate == linear):
        return 2**(-1*iteration)
    elif (coordinate == hyperbolic):
        return (math.atanh(2**(-1*iteration)))

def rotation_mode(x, y, z, coordinate, iterations):
    a = 0.607252935;   # = 1/K
    
    x_val_list = []
    y_val_list = []
    z_val_list = []
    iterations_list = []

    i = 0;                  # Keeps count on number of iterations
    
    current_x = x         # Value of X on ith iteration 
    current_y = y         # Value of Y on ith iteration
    current_z = z         # Value of Z on ith iteration
    
    di = 0
    
    if (coordinate == hyperbolic):
        i = 1
    else:
        i = 0
        
    flag = 0
    
    if (iterations > 0):
        while (i < iterations):
            if (current_z < 0):
                di = -1
            else:
                di = +1
            next_z = current_z - di * ROM_lookup(i, coordinate)
            next_x = current_x - coordinate * di * current_y * (2**(-1*i))
            next_y = current_y + di * current_x * 2**(-1*i)
            
            current_x = next_x
            current_y = next_y
            current_z = next_z

            x_val_list.append(current_x)
            y_val_list.append(current_y)
            z_val_list.append(current_z)
            
            iterations_list.append(i)
            
            if (coordinate == hyperbolic):
                if ((i != 4) & (i != 13) & (i!=40)):
                    i = i+1
                elif (flag == 0):
                    flag = 1
                elif (flag == 1):
                    flag = 0
                    i = i+1
            else:
                i = i+1
    return { 'x':x_val_list, 'y':y_val_list, 'z':z_val_list, 'iteration':iterations_list, }


ahhh, i recognise this code :)  i've seen it before.  
def vector_mode(x, y, z, coordinate, iterations):
    a = 1.2075;   # = 1/K
    
    x_val_list = []
    y_val_list = []
    z_val_list = []
    iterations_list = []

    i = 0;                  # Keeps count on number of iterations
    
    current_x = x         # Value of X on ith iteration 
    current_y = y         # Value of Y on ith iteration
    current_z = z         # Value of Z on ith iteration
    
    di = 0
    
    # This is neccesary since result for i=0 doesn't exists for hyperbolic 
    # co-ordinate system.
    if (coordinate == hyperbolic):
        i = 1
    else:
        i = 0
        
    flag = 0
    
    if (iterations > 0):
        while (i < iterations):
            di = -1*math.copysign(1, current_y);#*current_x);
            next_x = current_x - coordinate * di * current_y * (2**(-1*i))
            next_y = current_y + di * current_x * 2**(-1*i)
            next_z = current_z - di * ROM_lookup(i, coordinate)
            
            current_x = next_x
            current_y = next_y
            current_z = next_z

            x_val_list.append(current_x)
            y_val_list.append(current_y)
            z_val_list.append(current_z)
            
            iterations_list.append(i)
            
            if (coordinate == hyperbolic):
                if ((i != 4) & (i != 13) & (i!=40)):
                    i = i+1
                elif (flag == 0):
                    flag = 1
                elif (flag == 1):
                    flag = 0
                    i = i+1
            else:
                i = i+1
    return { 'x':x_val_list, 'y':y_val_list, 'z':z_val_list, 'iteration':iterations_list }

if you can drop that into a repository, get it working, write a
few simple experiments and see how it operates in each of the 6
modes, we're on a roll :)
Comment 54 Michael Nolan 2020-05-05 18:30:44 BST
(In reply to Luke Kenneth Casson Leighton from comment #53)
> ok what else can we throw in here?  LOG1P is probably a good thing to try
> 
> i know - somewhere - i've seen a python implementation of vector mode.
> it also specified, as you can see, the three other modes: circular,
> linear, hyperbolic.
> 
> ah!  i know why we want to keep the two answers: see p39 of that
> thesis: you can use them (with some post-processing) to do
> tan.

Yep, that was my thought. 

> "vector" mode the vector is rotated to be flat against the x-axis,
> recording the angle (Z) needed to do so.
> 
> 
> if you can drop that into a repository, get it working, write a
> few simple experiments and see how it operates in each of the 6
> modes, we're on a roll :)

I can give it a whirl I suppose
Comment 55 Luke Kenneth Casson Leighton 2020-05-06 10:53:22 BST
(In reply to Michael Nolan from comment #54)

> > if you can drop that into a repository, get it working, write a
> > few simple experiments and see how it operates in each of the 6
> > modes, we're on a roll :)
> 
> I can give it a whirl I suppose

when you have a moment, i noticed that the two functions are actually
incredibly similar.  line-by-line it is a multiply or add, here or there.


btw i saw the commit putting ctx back in.  to test that, cookie-cut
FPADDMuxInOut in fpadd/pipeline.py and fpadd/test/test_fpadd_pipe.py

the ReservationStations class takes two parameters: the bitwidth
and the *number of Function Units*.  it uses ctx to drop the FU number
in along the pipeline, and on the output it uses that FU number to
*re-associate* the result with the correct Function Unit.

runfp knows exactly how ReservationStations works, so there is no need to
write any actual unit test infrastructure.

there is a single-argument version of runfp (i think!) so... oh wait,
you call runfp with "singleop=True".

then instead of passing in "add" as a function to compute the unit
test result, you pass in "math.cos".

that will throw a massive suite of corner-cases at the pipeline, so
be prepared for it to fail as we still have to do INF, NAN, etc. :)