see https://libre-soc.org/resources/ page for more links to CORDIC see bug #127 for research: CORDIC is a binary search algorithm that can do a huge number of other algorithms: SIN, COS, LOG and more. it will need to be pipelined, split out in a similar fashion to the FPDIV pipeline. it would be good to have a multi-radix version which instead of a single bit per combinatorial block will do N bits by generating a cascade of adds and comparators, hopefully this will be more efficient than a cascade of combinatorial singlebit blocks the ATAN2 tables are going to need to be big enough to do 64 bit sin/cos etc. they could get huge. they will be ROM but still huge. although it would be nice to consider having the tables in exponent-mantissa form, thus maintaining extra accuracy without wasting bits by needing really large adders, it may be a good idea to add that as a secondary phase. also nice to have: complex numbers. these make certain other algorithms also possible to do. again, a tertiary phase. the input and output should be done as integers only, "scaled", and then, after demonstrating that (when converting input in unit tests to integers) sin and cos work. *after* that, the next phase will be to make an IEEE754 FP cos and sin, plus unit tests. this as a sub-bug of this one. then, other IEEE754 functions can be added as well. this is a simple CORDIC example which does sin and cos in myhdl, the nmigen could should be near identical in the inner loop http://www.myhdl.org/docs/examples/sinecomp/

btw michael, we did FPDIV in separate passes as well (took a couple of months). jacob researched algorithms that would cover div, sqrt and rsqrt, and did them as a simple integer only pipeline. then i hooked that up to the FP infrastructure and it worked well. i can recommend an incremental approach here as well: start from something small such as converting the myhdl code to nmigen in order to avoid bogging down with the FP infrastructure *and* with the many variations of CORDIC. it is only around 70 lines of code so very basic: don't even pipeline it at first, make it a FSM with a counter. then once that is achieved move on to the next small incremental improvement.

Created attachment 45 [details] found random source code (python) for cordic last year i did a lot of exploration of CORDIC, and found this in a working tree, not checked in. i added a "diff" to the original, so it can be seen to be accurate compared to python math. i think it might be a good idea to try it against python bigfloat, to see if higher accuracy on the atan table creates better accuracy. the number of iterations required is quite enormous: *45* iterations to get 14 D.P. accuracy, however doing more iterations doesn't really help (100 iterations only gets 16-17 dp) so i'm wondering how much the accuracy of the atan tables affects the output, and that means using bigfloat.

I was able to translate the linked myhdl code to nmigen and implement both a state machine based and pipelined cordic here: https://git.libre-soc.org/?p=ieee754fpu.git;a=tree;f=src/ieee754/cordic;h=3f06900baa42651410d661739fb616e4b3020c52;hb=77f150cff440bed025e2487b6f0fcda9f529290b You can configure the width, as well as the number of cordic rounds per pipeline stage. There are some accuracy issues with it though - it exactly matches the behavior of the myhdl one, but the results it gives can be off by 1% or so from that obtained using sin() or cos() on 16 bit numbers.

(In reply to Michael Nolan from comment #3) > I was able to translate the linked myhdl code to nmigen and implement both a > state machine based and pipelined cordic here: > https://git.libre-soc.org/?p=ieee754fpu.git;a=tree;f=src/ieee754/cordic; > h=3f06900baa42651410d661739fb616e4b3020c52; > hb=77f150cff440bed025e2487b6f0fcda9f529290b cool! well that was easy :) > You can configure the width, as well as the number of cordic rounds per > pipeline stage. There are some accuracy issues with it though - it exactly > matches the behavior of the myhdl one, but the results it gives can be off > by 1% or so from that obtained using sin() or cos() on 16 bit numbers. yes. i believe it's well-known that to get 16-bit accuracy you need 32-bit registers. i made a couple of corrections btw - frickin well done for working out the pipeline API with zero assistance - that's deeply impressive given that, other than the docstring comments, there's no "documentation" on it. CordicInitialData needed to "yield self.z0" (not self.z) "yield from" is reserved for a way to yield *from* something that is itself iteratable (such as another function). self.z0 is in fact iterable, so what would be returned by "yield from" would be first self.z0[0], then on the next iteration self.z0[1] and so on which would be very painful. "yield" just returns "that object". however... we have a bit of a problem back in nmutil.iocontrol NextControl and PrevControl iteration, in that it can't actually properly work out if the *object* that you are iterating should be yielded or yielded *from*. so i added a hack "check if data has a ports function". however i've just right now worked out that we could detect if the object is iterable with isinstance(self.data_o, Iterable). so that's just been committed. i also added an rtlil test because i was fascinated to see what the graphviz looks like. --- a/src/ieee754/cordic/pipe_data.py +++ b/src/ieee754/cordic/pipe_data.py @@ -10,11 +10,14 @@ class CordicInitialData: self.z0 = Signal(range(-ZMAX, ZMAX), name="z") # denormed result def __iter__(self): - yield from self.z + yield self.z0 def eq(self, i): return [self.z0.eq(i.z0)] class CordicData: @@ -27,9 +30,12 @@ class CordicData: self.z = Signal(range(-ZMAX, ZMAX), name="z") # denormed result def __iter__(self): - yield from self.x - yield from self.y - yield from self.z + yield self.x + yield self.y + yield self.z def eq(self, i): ret = [self.z.eq(i.z), self.x.eq(i.x), self.y.eq(i.y)] diff --git a/src/ieee754/cordic/test/test_pipe.py b/src/ieee754/cordic/test/test_pipe.py index 880351a..b7eaf8a 100644 --- a/src/ieee754/cordic/test/test_pipe.py +++ b/src/ieee754/cordic/test/test_pipe.py @@ -1,6 +1,7 @@ from nmigen import Module, Signal from nmigen.back.pysim import Simulator, Passive from nmigen.test.utils import FHDLTestCase +from nmigen.cli import rtlil from ieee754.cordic.sin_cos_pipeline import CordicBasePipe from ieee754.cordic.pipe_data import CordicPipeSpec @@ -16,6 +17,13 @@ class SinCosTestCase(FHDLTestCase): pspec = CordicPipeSpec(fracbits=fracbits, rounds_per_stage=4) m.submodules.dut = dut = CordicBasePipe(pspec) + for port in dut.ports(): + print ("port", port) + + vl = rtlil.convert(dut, ports=dut.ports()) + with open("test_cordic_pipe_sin_cos.il", "w") as f: + f.write(vl) + z = Signal(dut.p.data_i.z0.shape()) z_valid = Signal() ready = Signal() (END)

ok so would you like to do the next step? which is, yep you guessed it: IEEE754 FP sin and cos :) i think... from doing some reading (last year) i have a vague recollection that you have to be careful with the smaller ranges at one end, thus only using a 45 degree range (+/-) and subtracting and adding accordingly in order to compute an intermediary result, followed by adjustment (add/subtract to/from fractions of pi). if you naively try to do from 0-90 then at one end (i forget which) because sin/cos is flat at one end, something to do with the atan2 tables getting very small, you end up with inaccuracies creeping in and stacking up rapidly.

(In reply to Luke Kenneth Casson Leighton from comment #5) > ok so would you like to do the next step? which is, yep you guessed it: > IEEE754 FP sin and cos :) I'm not really sure how to do this though. I know you did a similar thing on the divider, did you handle it as a float the entire way through or convert it to integer... somehow? Doing it as a float means that the adders/subtractors need to be full floating point adders, like those in fpadd/, right? For converting the inputs to integers, it seems like there would be issues as the input angle approached 0 (because floats give more resolution there). Cole emailed me a paper (which I linked to in the resources page) that contained a hybrid cordic that switched to a taylor series approximation when the input became small enough. This seems like it might work reasonably well for handling the small input case. Thoughts?

(In reply to Michael Nolan from comment #6) > (In reply to Luke Kenneth Casson Leighton from comment #5) > > ok so would you like to do the next step? which is, yep you guessed it: > > IEEE754 FP sin and cos :) > > I'm not really sure how to do this though. I know you did a similar thing on > the divider, did you handle it as a float the entire way through or convert > it to integer... somehow? integer, and kept things at the same exponent. i.e. we: * adjusted the relative range of both numbers a and b so that they had the same exponent (-1 from exp, shift mantissa until exp(a)==exp(b) * keeping enough bits (3x the IEEE754 length in some places) at all times so that there will be no loss of precision * from that point on treated the mantissa as an integer, *completely ignoring* the exponent until * finally at the end "re-normalising" the result. > Doing it as a float means that the adders/subtractors need to be full > floating point adders, like those in fpadd/, right? actually fpadd simply does the same trick above as well. here's the original jon dawson verilog code that we started from: https://github.com/dawsonjon/fpu/blob/master/adder/adder.v#L174 you can see there's no "actual" FPadds, it's literally, "line up the exponents, do an integer add, then adjust/normalise the result". > For converting the inputs to integers, it seems like there would be issues > as the input angle approached 0 (because floats give more resolution there). it's ok to have double (or even triple) the number of bits being used for the fixed-integer computation. so where the mantissa for FP32 is 23 bits, it's ok to use 23*3 = 69 bits (plus a few more for rounding). > Cole emailed me a paper (which I linked to in the resources page) that > contained a hybrid cordic that switched to a taylor series approximation > when the input became small enough. This seems like it might work reasonably > well for handling the small input case. unfortunately, as we will (later) be using the CORDIC block for other purposes, not just SIN/COS but as an actual opcode, if it doesn't do "actual CORDIC" we're stuffed, there. i suspect that it might be possible, as the size of the ATAN2 constants gets smaller, to "shift" the result up slightly. this we can do - test out - right now long before trying to move to FP. each CordicStage can take an extra argument which specifies that dx, dy and dz should be shifted by N extra bits. the transfer of input to output needs to take into consideration that the next stage is going to be shifted by N. in this way, smaller numbers don't end up losing quite so many bits. the only thing we have to make sure is that, throughout the whole pipeline, there's enough extra bits such that the shifts do *not* end hitting the ceiling of the maximum range of x (or y).

ah. on other thing. i don't really trust the standard c-library atan to be accurate enough. i think we should use python bigfloat, here. it can produce arbitrary precision floating-point. that way we can ensure that for FP64 CORDIC (64-bit sin/cos), by using 128-bit (or greater) atan2 tables, we get the full accuracy needed to get the right answers for IEEE754 sin/cos FP64. if we try to use the "standard" math atan2 library i guarantee we will fail to get sufficient accuracy at the limits of the range.

cole wrote (on the list): http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-April/005993.html Here's the link to the paper, which you've linked to on the resources page. I just thought I'd include it in this bug/mailing list for ease of access: "Low Latency and Low Error Floating-Point Sine/Cosine Function Based TCORDIC Algorithm" (https://ieeexplore.ieee.org/document/7784797). For project members that do not have IEEE explore access, please email Michael or myself off-list. I also think that this paper may be relevant, however, I am not entirely sure. "Pipelined Range Reduction Based Truncated Multiplier" (https://ieeexplore.ieee.org/document/9062697).

(In reply to Luke Kenneth Casson Leighton from comment #9) > cole wrote (on the list): > "Pipelined Range Reduction Based Truncated Multiplier" > (https://ieeexplore.ieee.org/document/9062697). appreciated. things like this are important to look at because we can always learn something, even if it's "nice, however we're not going to do it that way".

(In reply to Luke Kenneth Casson Leighton from comment #7) > integer, and kept things at the same exponent. i.e. we: > > * adjusted the relative range of both numbers a and b so that they > had the same exponent (-1 from exp, shift mantissa until exp(a)==exp(b) > * keeping enough bits (3x the IEEE754 length in some places) at all times > so that there will be no loss of precision > * from that point on treated the mantissa as an integer, *completely > ignoring* > the exponent until > * finally at the end "re-normalising" the result. Ah ok, this makes more sense. > unfortunately, as we will (later) be using the CORDIC block for other > purposes, > not just SIN/COS but as an actual opcode, if it doesn't do "actual CORDIC" > we're stuffed, there. Ah I see. So at some point we'll be wanting to use this for general rotations, hyperbolic sin/cos, and such? > > i suspect that it might be possible, as the size of the ATAN2 constants > gets smaller, to "shift" the result up slightly. > > this we can do - test out - right now long before trying to move to FP. > each CordicStage can take an extra argument which specifies that dx, dy and > dz should be shifted by N extra bits. I think I sort of understand - you want to scale up dx, dy (and presumably x and y) when they're small so that they have more bits of precision. One thing that occurred to me - since power (and I assume riscv) don't include any sin/cos opcodes but instead use a function in libc to calculate them, the cordic would only be used by the GPU, right? If that's the case, do sin/cos necessarily need to be as accurate as the result from libc?

(In reply to Michael Nolan from comment #11) > (In reply to Luke Kenneth Casson Leighton from comment #7) > > unfortunately, as we will (later) be using the CORDIC block for other > > purposes, > > not just SIN/COS but as an actual opcode, if it doesn't do "actual CORDIC" > > we're stuffed, there. > > Ah I see. So at some point we'll be wanting to use this for general > rotations, hyperbolic sin/cos, and such? > > > > > i suspect that it might be possible, as the size of the ATAN2 constants > > gets smaller, to "shift" the result up slightly. > > > > this we can do - test out - right now long before trying to move to FP. > > each CordicStage can take an extra argument which specifies that dx, dy and > > dz should be shifted by N extra bits. > > I think I sort of understand - you want to scale up dx, dy (and presumably x > and y) when they're small so that they have more bits of precision. > > One thing that occurred to me - since power (and I assume riscv) don't > include any sin/cos opcodes but instead use a function in libc to calculate > them, the cordic would only be used by the GPU, right? If that's the case, > do sin/cos necessarily need to be as accurate as the result from libc? Michael, here are some possibly relevant links to FPU accuracy... or rather inaccuracy. I've also added them to the resources page under "5.4 Past FPU Mistakes to learn from". Intel Underestimates Error Bounds by 1.3 quintillion Posted on October 9, 2014 to Random ASCII – tech blog of Bruce Dawson https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/ Intel overstates FPU accuracy 06/01/2013 http://notabs.org/fpuaccuracy/ Cole

(In reply to Michael Nolan from comment #11) > (In reply to Luke Kenneth Casson Leighton from comment #7) > > * from that point on treated the mantissa as an integer, *completely > > ignoring* > > the exponent until > > * finally at the end "re-normalising" the result. > > Ah ok, this makes more sense. FP is odd, it can only be done in fixedpoint. however it is actually relatively straightforward. the only real tricky bits are that rounding rules and interactions can be seriously obtuse. > > > unfortunately, as we will (later) be using the CORDIC block for other > > purposes, > > not just SIN/COS but as an actual opcode, if it doesn't do "actual CORDIC" > > we're stuffed, there. > > Ah I see. So at some point we'll be wanting to use this for general > rotations, hyperbolic sin/cos, and such? yes. a general purpose CORDIC engine with all possible modes (i think there are at least 3: linear, rotate and hyperbolic) and to make an instruction that can access aaall those modes... ...yet for sin, cos, log etc use that exact same engine by setting some of those parameters manually. > > > > i suspect that it might be possible, as the size of the ATAN2 constants > > gets smaller, to "shift" the result up slightly. > > > > this we can do - test out - right now long before trying to move to FP. > > each CordicStage can take an extra argument which specifies that dx, dy and > > dz should be shifted by N extra bits. > > I think I sort of understand - you want to scale up dx, dy (and presumably x > and y) when they're small so that they have more bits of precision. correct. however we cannot lose sight of how *many* shifts occurred (in each stage and cumulatively) because each shift would be a doubling (halving?) of the ultimate answer. i honestly do not know if it is a good idea to have dynamic shifting or not. it would mean testing both x and y, but the problem is that might get small whilst y gets large. keeping track of whether shifting occurs, it is actually an exponent! you double the mantissa (shift x/y up) however you need to *subtract* one from the exponent when doing so. and because x can get large while y gets small and vice versa now you would need *two* exponents. worse than that, you need 2 copies of z because one z needs likewise shifting down and the other up. too complicated. hm. i'm therefore tempted to suggest dropping one of x or y and setting a mode to decide whether to compute a sin or cos result (but not both). see how it goes. > One thing that occurred to me - since power (and I assume riscv) don't > include any sin/cos opcodes but instead use a function in libc to calculate > them, the cordic would only be used by the GPU, right? well if we propose the opcodes for fpsin and fpcos and they are accepted by the OpenPOWER Foundation for a future version of POWER then yes some hardware would call the opcodes whilst others would emulate it using libc. or libm. > If that's the case, > do sin/cos necessarily need to be as accurate as the result from libc? we need to be Vulkan compliant however i think as long as it does not use massive amounts of power we can do IEEE754 compliance. if you mean testing *against* libc (libm), i honestly do not know if libm is fully IEEE754 compliant and that's why i suggested testing against python bigfloat for now. later we will definitely need to use jacob's algorithmic library, to make sure we have full compliance at the bit level.

btw that "close to zero" is i think what wad meant about altering the range and adding fractions of pi to the result. sin x + 90 equals cos x therefore in the ranges where either gets close to zero you use the OTHER version.

normally, for implementing sin, you specifically want to try to switch to inputs close to zero rather than away, since that allows retaining much higher accuracy, since for small x: sin(x) ≈ x (≈ is the approximately equal operator).

(In reply to Jacob Lifshay from comment #15) > normally, for implementing sin, you specifically want to try to switch to > inputs close to zero rather than away, since that allows retaining much > higher accuracy, since for small x: sin(x) ≈ x (≈ is the approximately equal > operator). interesting, how we will be able to match that up with CORDIC.

I'm working on converting the floating point input to the fixed point value needed by the cordic. I've got my module decoding the float into sign, exponent, and mantissa with FPNumDecode, and the mantissa is missing the (implied) leading 1 bit. I'm also looking at the modules of fpdiv (div0.py and nmigen_div_experiment.py), and they don't seem to add that bit back in. Is there a way to have FPNumDecode generate a mantissa with the leading 1 bit?

(In reply to Michael Nolan from comment #17) > I'm working on converting the floating point input to the fixed point value > needed by the cordic. I've got my module decoding the float into sign, > exponent, and mantissa with FPNumDecode, and the mantissa is missing the > (implied) leading 1 bit. I'm also looking at the modules of fpdiv (div0.py > and nmigen_div_experiment.py), and they don't seem to add that bit back in. > Is there a way to have FPNumDecode generate a mantissa with the leading 1 > bit? I think that's done in ieee754.fpcommon.denorm: https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpcommon/denorm.py;h=350f413ede9858ae490059f6709b3454ea8a7f1e;hb=HEAD

(In reply to Jacob Lifshay from comment #18) > (In reply to Michael Nolan from comment #17) > > I'm working on converting the floating point input to the fixed point value > > needed by the cordic. I've got my module decoding the float into sign, > > exponent, and mantissa with FPNumDecode, and the mantissa is missing the > > (implied) leading 1 bit. I'm also looking at the modules of fpdiv (div0.py > > and nmigen_div_experiment.py), and they don't seem to add that bit back in. > > Is there a way to have FPNumDecode generate a mantissa with the leading 1 > > bit? > > I think that's done in ieee754.fpcommon.denorm: > https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpcommon/ > denorm.py;h=350f413ede9858ae490059f6709b3454ea8a7f1e;hb=HEAD line 41, yes. ok. i only sort-of "understood" IEEE754 FP retrospectively, through long exposure, having started initially from jon dawson's "working" verilog code and making absolutely damn sure, throughout the entire six months of morphing it, that it passed the unit tests *at all times*. this made for a bit of a mess at times. i therefore substituted "unit tests pass 100%" for "understanding what the hell is really going on". line 41. https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpcommon/denorm.py;h=350f413ede9858ae490059f6709b3454ea8a7f1e;hb=HEAD#l41 here you can see that the top bit of a (and later b) are, yes, set to 1 if and only if the exponent is not equal to -126 [for FP32] back-track a little: the data structure FPSCData has had one *extra* bit already allocated at the top (m[-1]) which is *specifically* allocated for the purpose of allowing that extra implicit mantissa "1" to be inserted *if it is needed*. the reason why it is sometimes not needed is this: when the exponent is "-126" (on FP32), this is "ridiculously small number" but it is not the limit of the range of IEEE754. what they allowed was: when exp=-126, you can express smaller and smaller numbers by simply having less bits in the *mantissa*. obviously, those numbers become less and less accurate however this is an acceptable tradeoff. however when you do that, you have to go from "implicit 1 in the top digit" to "explicit 1 in the mantissa itself". thus, we have that Mux detecting the transition condition between these "implicit 1" and "really small numbers with explicit 1 *ALREADY* in the mantissa". the whole process is called "de-normalisation". so basically, you can just "trust" that after the number has gone through FPAddDeNorm, it *will* have the mantissa and exponent correctly set, such that you can treat that "expanded mantissa" as just "A Fixed Integer". btw i don't recommend trying multiplying or dividing either the input or the output by pi inside the FP CORDIC code. we can use micro-coding for that (and in the unit tests do the divide and/or multiply using import math or import bigfloat). so let us assume that we're implementing cospi and sinpi, here, *NOT* actual cos or actual sin, ok?

(In reply to Luke Kenneth Casson Leighton from comment #19) > btw i don't recommend trying multiplying or dividing either the input > or the output by pi inside the FP CORDIC code. we can use micro-coding for > that (and in the unit tests do the divide and/or multiply using import math > or import bigfloat). > > so let us assume that we're implementing cospi and sinpi, here, *NOT* > actual cos or actual sin, ok? That will allow us to have a much more accurate cospi and sinpi because we don't need multiple conversions. Also, since cospi and sinpi have a period of 2 which is a power of 2, it is super easy to do range reduction -- just some bit masking.

(In reply to Jacob Lifshay from comment #18) > > I think that's done in ieee754.fpcommon.denorm: > https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpcommon/ > denorm.py;h=350f413ede9858ae490059f6709b3454ea8a7f1e;hb=HEAD Ah, thank you!

michael you may find this extremely informative to experiment with: http://weitz.de/ieee/

argh, michael, python3-gmpy2 is not properly packaged on debian. i can't even *install* python3-gmpy2 for python3.7! apt-get install python3-gmpy2=2.1.0~a4-1 Reading package lists... Done Building dependency tree Reading state information... Done Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies: python3-gmpy2 : Depends: python3 (< 3.8) but 3.8.2-3 is to be installed E: Unable to correct problems, you have held broken packages.

https://pypi.org/project/bigfloat/ i used python bigfloat about... 8 years ago, and just updated it: it works well. it also happens to work in a similar way to python-gmpy2 by cython-wrapping mpfr... except that the debian package maintainer for python-gmpy2 is not experienced enough to do multi-python-version compiles. python-bigfloat i specifically recommended because it has been "wrapped" with additional functionality that makes it near-identical to both the standard python "math" module *and* to the standard python "float". from bigfloat import cos, sin, tan therefore works

I can add sin, cos, tan, etc. support to simple-soft-float.

(In reply to Luke Kenneth Casson Leighton from comment #24) > https://pypi.org/project/bigfloat/ > > i used python bigfloat about... 8 years ago, and just updated it: > it works well. it also happens to work in a similar way to > python-gmpy2 by cython-wrapping mpfr... except that the debian > package maintainer for python-gmpy2 is not experienced enough > to do multi-python-version compiles. Ok, I reverted the commit containing the gmpy stuff and added a similar one calculating the atan tables with bigfloat

(In reply to Michael Nolan from comment #26) > (In reply to Luke Kenneth Casson Leighton from comment #24) > > https://pypi.org/project/bigfloat/ > > > > i used python bigfloat about... 8 years ago, and just updated it: > > it works well. it also happens to work in a similar way to > > python-gmpy2 by cython-wrapping mpfr... except that the debian > > package maintainer for python-gmpy2 is not experienced enough > > to do multi-python-version compiles. > > Ok, I reverted the commit containing the gmpy stuff and added a similar one > calculating the atan tables with bigfloat with the debian package being a mess that's probably the sensible option for now. there's a trick btw, i think it's: with bf.precision(300): # use 300 bits of precision x = bf.atan(BigFloat(2) ....) which is why i recommended bigfloat: you can specify extremely large FP precision.

(In reply to Michael Nolan from comment #26) > Ok, I reverted the commit containing the gmpy stuff and added a similar one > calculating the atan tables with bigfloat Please also add the libgmp-dev package (or whatever the correct name is) to .gitlab-ci.yml

(In reply to Luke Kenneth Casson Leighton from comment #27) > with bf.precision(300): # use 300 bits of precision > x = bf.atan(BigFloat(2) ....) > > which is why i recommended bigfloat: you can specify extremely large > FP precision. Yup. I went with quad precision for now. (In reply to Jacob Lifshay from comment #28) > Please also add the libgmp-dev package (or whatever the correct name is) to > .gitlab-ci.yml Done.

it's sorted now however i just found this: python3-mpmath it's a pure-python-only arbitrary-precision FP math library. it doesn't say whether it's IEEE754 compliant however.

(In reply to Luke Kenneth Casson Leighton from comment #30) > it's sorted now however i just found this: python3-mpmath > it's a pure-python-only arbitrary-precision FP math library. it doesn't > say whether it's IEEE754 compliant however. ieee754 compliant basically means the results are rounded properly to an ieee754 binary/decimal format, mpmath is instead designed to give as many digits of accuracy as you want, it's designed to emulate mathematical real/complex numbers rather than ieee 754 arithmetic.

(In reply to Michael Nolan from comment #29) > (In reply to Jacob Lifshay from comment #28) > > Please also add the libgmp-dev package (or whatever the correct name is) to > > .gitlab-ci.yml > > Done. You used tabs instead of spaces, which broke the yaml parser, please fix.

rats. i just realised, the dynamic shifting idea isn't going to work. the problem is that the atan table value may be small, whilst the x/y large, *or vice versa*. so this needs large shifters and complex logic, i dont think its worth it.

this may also help to understand IEEE754 FP https://imgs.xkcd.com/comics/garbage_math.png

So I think I got the floating point cordic all the way working yesterday. It converts the floating point input to a fixed point 50 bit number, calculates the sine and cosine, then converts them back to floating point. However, for the conversion back to floating point I had to write my own class to do so, as none of the existing ones (used in fpadd for instance) support *two* outputs. I don't imagine we want to continue using my implementation, so what would be the best way of going about this? I also needed to write a module for counting the number of leading zeros in the fixed point number. I don't imagine this would only be useful to the cordic; where would be the best place to put it? fpcommon?

(In reply to Michael Nolan from comment #35) > So I think I got the floating point cordic all the way working yesterday. It > converts the floating point input to a fixed point 50 bit number, calculates > the sine and cosine, then converts them back to floating point. hurrah! (just running the tests, you forgot to commit cordic/renormalize.py. for future reference, use "git status" and you should spot missing files, fairly easily. that's if we're properly maintaining ".gitignore") > However, for the conversion back to floating point I had to write my own > class to do so, as none of the existing ones (used in fpadd for instance) > support *two* outputs. yeah, keenly aware of that. > I don't imagine we want to continue using my > implementation, so what would be the best way of going about this? hmmm hmmm well we will actually need double-output functions at some point (and more): NORMALISE, DOTPRODUCT, they're all multi-output. however for the plain "SIN" and "COS" operations - and others that will use "General CORDIC" - we simply *drop* the output on the floor if it's unused. if however we do macro-op fusion, spotting two (commonly-used) operations together, the two results *will* be needed. > I also needed to write a module for counting the number of leading zeros in > the fixed point number. ah goood. actually i think... somewhere... very ineptly-named, there is a class that does this. probably "MultiShiftRMerge". don't worry about it :) > I don't imagine this would only be useful to the > cordic; where would be the best place to put it? fpcommon? on one hand i'd be inclined to suggest nmutil (because CLZ is pretty useful, generally) on the other, it's a basic "ALU" operation. hmm hmm... the decision would be based then on whether CLZ (count leading zeros) has uses *outside* of an ALU, and my feeling is, it probably does. actually, just looking at it, i really like it! it reminds me of the reference that Jacob put... what was it... "Common Task something"?

ah excellent! you did a FSM-based version as well, that's superb. if the pipelined version turns out to be insanely gate-hungry, or just not that important, we can use it. also as a "general-purpose library" it's extremely useful to have because other developers might not want a full pipeline version.

(In reply to Luke Kenneth Casson Leighton from comment #36) > > (just running the tests, you forgot to commit cordic/renormalize.py. > for future reference, use "git status" and you should spot missing > files, fairly easily. that's if we're properly maintaining ".gitignore") Oops, fixed! > > > > However, for the conversion back to floating point I had to write my own > > class to do so, as none of the existing ones (used in fpadd for instance) > > support *two* outputs. > > yeah, keenly aware of that. > > > I don't imagine we want to continue using my > > implementation, so what would be the best way of going about this? > > hmmm hmmm well we will actually need double-output functions at some > point (and more): NORMALISE, DOTPRODUCT, they're all multi-output. > > however for the plain "SIN" and "COS" operations - and others that > will use "General CORDIC" - we simply *drop* the output on the floor > if it's unused. > > if however we do macro-op fusion, spotting two (commonly-used) > operations together, the two results *will* be needed. Ah ok, so maybe I should add an input that selects one or the other? (kinda sucks that we need to calculate both for the cordic to work, oh well) (In reply to Luke Kenneth Casson Leighton from comment #37) > ah excellent! you did a FSM-based version as well, that's superb. > if the pipelined version turns out to be insanely gate-hungry, or > just not that important, we can use it. Unfortunately, that one isn't as complete as the pipelined one. It doesn't handle denormals or 0 well because I had to roll my own denormalization stage instead of using the existing one from the multiplier. And it doesn't (yet) support conversion back into floating point at the end. > > also as a "general-purpose library" it's extremely useful to have > because other developers might not want a full pipeline version. Makes sense.

(In reply to Luke Kenneth Casson Leighton from comment #37) > ah excellent! you did a FSM-based version as well, that's superb. > if the pipelined version turns out to be insanely gate-hungry, or > just not that important, we can use it. Ooof, yeah. Running synthesis on the 32 bit pipelined version (4 cordic rounds per pipeline stage) gives somewhere around 77k gates.

Luke, your latest commit (74af221) breaks the tests for the pipelined fpsin: ====================================================================== ERROR: test_rand (__main__.SinCosTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_fp_pipe.py", line 80, in test_rand self.run_test(iter(inputs), outputs=iter(outputs)) File "test_fp_pipe.py", line 21, in run_test vl = rtlil.convert(dut, ports=dut.ports()) File "/home/mnolan/.local/lib/python3.8/site-packages/nmigen/back/rtlil.py", line 1017, in convert fragment = ir.Fragment.get(elaboratable, platform).prepare(**kwargs) File "/home/mnolan/.local/lib/python3.8/site-packages/nmigen/hdl/ir.py", line 540, in prepare raise TypeError("Only signals may be added as ports, not {!r}" TypeError: Only signals may be added as ports, not (slice (sig x) 0:1) It looks like dut.ports() is returning a bunch of slices for the output signals and I'm not sure why: [(sig p_valid_i), (sig p_ready_o), (sig a), (sig muxid), (sig op), (sig n_ready_i), (sig n_valid_o), (slice (sig x) 0:1), (slice (sig x) 1:2), (slice (sig x) 2:3), (slice (sig x) 3:4), (slice (sig x) 4:5), (slice (sig x) 5:6), (slice (sig x) 6:7), (slice (sig x) 7:8), (slice (sig x) 8:9), (slice (sig x) 9:10), (slice (sig x) 10:11), (slice (sig x) 11:12), (slice (sig x) 12:13), (slice (sig x) 13:14), (slice (sig x) 14:15), (slice (sig x) 15:16), (slice (sig x) 16:17), (slice (sig x) 17:18), (slice (sig x) 18:19), (slice (sig x) 19:20), (slice (sig x) 20:21), (slice (sig x) 21:22), (slice (sig x) 22:23), (slice (sig x) 23:24), (slice (sig x) 24:25), (slice (sig x) 25:26), (slice (sig x) 26:27), (slice (sig x) 27:28), (slice (sig x) 28:29), (slice (sig x) 29:30), (slice (sig x) 30:31), (slice (sig x) 31:32), (slice (sig y) 0:1), (slice (sig y) 1:2), (slice (sig y) 2:3), (slice (sig y) 3:4), (slice (sig y) 4:5), (slice (sig y) 5:6), (slice (sig y) 6:7), (slice (sig y) 7:8), (slice (sig y) 8:9), (slice (sig y) 9:10), (slice (sig y) 10:11), (slice (sig y) 11:12), (slice (sig y) 12:13), (slice (sig y) 13:14), (slice (sig y) 14:15), (slice (sig y) 15:16), (slice (sig y) 16:17), (slice (sig y) 17:18), (slice (sig y) 18:19), (slice (sig y) 19:20), (slice (sig y) 20:21), (slice (sig y) 21:22), (slice (sig y) 22:23), (slice (sig y) 23:24), (slice (sig y) 24:25), (slice (sig y) 25:26), (slice (sig y) 26:27), (slice (sig y) 27:28), (slice (sig y) 28:29), (slice (sig y) 29:30), (slice (sig y) 30:31), (slice (sig y) 31:32), (sig muxid), (sig op)]

(In reply to Michael Nolan from comment #38) > (In reply to Luke Kenneth Casson Leighton from comment #36) > > > > (just running the tests, you forgot to commit cordic/renormalize.py. > > for future reference, use "git status" and you should spot missing > > files, fairly easily. that's if we're properly maintaining ".gitignore") > > Oops, fixed! been there :) > > if however we do macro-op fusion, spotting two (commonly-used) > > operations together, the two results *will* be needed. > > Ah ok, so maybe I should add an input that selects one or the other? yes. an "Op". this is what the "Op" field is for... see... ahh.. 1sec... DivPipeCoreConfig and DivPipeCoreOperation (etc.) https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/div_rem_sqrt_rsqrt/core.py;h=fea59dce0c4f73d652985faa8181d5af97d5877c;hb=dd3919a0df8e1190f2fe985b0448701f164b392a#l54 which is set up as the "context" (ctx) here: https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/div_rem_sqrt_rsqrt/div_pipe.py;h=d7b81729499b5d763eb1d150b031dceeefecbda9;hb=dd3919a0df8e1190f2fe985b0448701f164b392a#l42 and it is of type "FPPipeContext", declared here: https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpcommon/getop.py;h=117d29223d23ab8176382e76b2dd129382735721;hb=dd3919a0df8e1190f2fe985b0448701f164b392a#l65 and if you look at lines 63 and 65 you can see that you can EITHER: * specify just a bitwidth (in which case you get a signal and it's your responsibility to manage conversion from the "op" into that Signal) OR * specify an *actual class* as a parameter to FPPipeContext, and an instance *of* that class will be instantiated on your behalf. it's actually really important to conform to this API because it's where the "cancellation", identifier information etc. critically necessary for the use of the pipeline in the scoreboard system is all placed. > (kinda > sucks that we need to calculate both for the cordic to work, oh well) :) divide is the same. it produces two results: DIV and REM. fascinatingly CORDIC and DIV (and SQRT and RSQRT) all use the same core concept / algorithm, so it is no surprise that there are two results. > (In reply to Luke Kenneth Casson Leighton from comment #37) > > ah excellent! you did a FSM-based version as well, that's superb. > > if the pipelined version turns out to be insanely gate-hungry, or > > just not that important, we can use it. > > Unfortunately, that one isn't as complete as the pipelined one. It doesn't > handle denormals or 0 well because I had to roll my own denormalization > stage instead of using the existing one from the multiplier. And it doesn't > (yet) support conversion back into floating point at the end. ok not a problem. it got you bootstrapped to where you needed to go. let's leave it for now, but do document that (briefly). > > > > also as a "general-purpose library" it's extremely useful to have > > because other developers might not want a full pipeline version. > > Makes sense. can you do that move? the unit test as well.

(In reply to Michael Nolan from comment #40) > Luke, your latest commit (74af221) breaks the tests for the pipelined fpsin: yep i commented it out for now (and explained why!) > ====================================================================== > ERROR: test_rand (__main__.SinCosTestCase) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_fp_pipe.py", line 80, in test_rand > self.run_test(iter(inputs), outputs=iter(outputs)) > File "test_fp_pipe.py", line 21, in run_test > vl = rtlil.convert(dut, ports=dut.ports()) > File > "/home/mnolan/.local/lib/python3.8/site-packages/nmigen/back/rtlil.py", line > 1017, in convert > fragment = ir.Fragment.get(elaboratable, platform).prepare(**kwargs) > File "/home/mnolan/.local/lib/python3.8/site-packages/nmigen/hdl/ir.py", > line 540, in prepare > raise TypeError("Only signals may be added as ports, not {!r}" > TypeError: Only signals may be added as ports, not (slice (sig x) 0:1) > > It looks like dut.ports() is returning a bunch of slices for the output > signals and I'm not sure why: yep it means that in the data_i or data_o you've missed a "yield from" and used "yield" instead. or vice-versa. "yield from x" would iterate *through x* and that tells nmigen "please return slices of single-bits of x". i'll take a look

(In reply to Luke Kenneth Casson Leighton from comment #41) > > yes. an "Op". this is what the "Op" field is for... see... ahh.. 1sec... > DivPipeCoreConfig and DivPipeCoreOperation (etc.) > Ok > > it's actually really important to conform to this API because it's where > the "cancellation", identifier information etc. critically necessary for > the use of the pipeline in the scoreboard system is all placed. Oh hold on, I thought that cancellation bit was for the pipeline to produce a result early. Since the cordic doesn't do that, I left it out. It sounds like I need to add that back in huh? > > can you do that move? the unit test as well. Sorry, you mentioned nmutil and never said which place you thought it should go. nmutil or fpcommon?

found it. the code was doing exactly as you'd asked: walking through self.x and self.y by iteration. "yield from" will treat the object *as* an iterator, returning calls to object.__iter__() and returning those objects in sequence, as *MULTIPLE* results. "yield" will return the OBJECT - once and only once as *THE* (one and only) result. if the Signal did not have an __iter__ function, you would have got an error. diff --git a/src/ieee754/cordic/pipe_data.py b/src/ieee754/cordic/pipe_data.py index c3f600a..d83dbe7 100644 --- a/src/ieee754/cordic/pipe_data.py +++ b/src/ieee754/cordic/pipe_data.py @@ -28,8 +28,8 @@ class CordicOutputData: self.muxid = self.ctx.muxid def __iter__(self): - yield from self.x - yield from self.y + yield self.x + yield self.y yield from self.ctx def eq(self, i):

(In reply to Luke Kenneth Casson Leighton from comment #42) > > yep it means that in the data_i or data_o you've missed a "yield from" > and used "yield" instead. or vice-versa. > > "yield from x" would iterate *through x* and that tells nmigen "please > return slices of single-bits of x". > > i'll take a look Aha! pipe_data.py has the following: def __iter__(self): yield from self.x yield from self.y yield from self.ctx I'll try changing those to yields and see if that works

(In reply to Michael Nolan from comment #43) > (In reply to Luke Kenneth Casson Leighton from comment #41) > > > > yes. an "Op". this is what the "Op" field is for... see... ahh.. 1sec... > > DivPipeCoreConfig and DivPipeCoreOperation (etc.) > > > Ok > > > > it's actually really important to conform to this API because it's where > > the "cancellation", identifier information etc. critically necessary for > > the use of the pipeline in the scoreboard system is all placed. > > Oh hold on, I thought that cancellation bit was for the pipeline to produce > a result early. Since the cordic doesn't do that, I left it out. It sounds > like I need to add that back in huh? yyup :) all operations need to be "cancellable" by way of a globally-broadcast mask, equal in width to the number of Concurrent Computation Unit "front-ends". see https://libre-soc.org/3d_gpu/architecture/6600scoreboard/ "Concurrent Computation Unit" diagram (which is from Mitch's Scoreboard Mechanics, 2nd chapter). so there will be... how many FP stages? 8? eek! 12! :) so we will need *TWELVE* Function Unit front-ends, one to manage and track every single result, and "re-match" that result *BACK* to the FU when it pops out. now, we _could_ not have cancellation. we could instead do this: * let the result come out and when it arrives, throw it away. this means we have a Function Unit completely wasted, blocked and unusable for up to 12 cycles * create a complex "ID" system (at least double the number of FU frontends) not a big fan of this method. a counter wraps around, the ID associated with "new" requests, and is cleared to zero if the result is not to be put into the result FU buffer. the simpler way is: have an unary "broadcast" mask that goes right the way through the whole pipeline (one bit per "FU") and we just... simply... pull that bit. result dies. FunctionUnit can rest. job done. so yes, you'll need to put that back in. by conforming to the API, you should *NOT* have to worry about whether the data is cancelled or not, because the "MaskCancellable" Pipeline Mix-In class takes care of that *automatically*. but. if FPPipeContext is *not used*... that is *not possible to do*. > > > > > can you do that move? the unit test as well. > > Sorry, you mentioned nmutil and never said which place you thought it should > go. nmutil or fpcommon? oh sorry. nmutil.

(In reply to Michael Nolan from comment #45) > Aha! pipe_data.py has the following: > > def __iter__(self): > yield from self.x > yield from self.y > yield from self.ctx > > I'll try changing those to yields and see if that works done already. git pull. sorry! :)

(In reply to Luke Kenneth Casson Leighton from comment #46) > > oh sorry. nmutil. Ok! (In reply to Luke Kenneth Casson Leighton from comment #47) > (In reply to Michael Nolan from comment #45) > > > Aha! pipe_data.py has the following: > > > > def __iter__(self): > > yield from self.x > > yield from self.y > > yield from self.ctx > > > > I'll try changing those to yields and see if that works > > done already. git pull. sorry! :) Yep, saw that after I posted my comment :P

(In reply to Michael Nolan from comment #39) > (In reply to Luke Kenneth Casson Leighton from comment #37) > > ah excellent! you did a FSM-based version as well, that's superb. > > if the pipelined version turns out to be insanely gate-hungry, or > > just not that important, we can use it. > > Ooof, yeah. Running synthesis on the 32 bit pipelined version (4 cordic > rounds per pipeline stage) gives somewhere around 77k gates. uh-huhn :) so you see why i thought it might be a teeeny tiny bit of a good idea to make it as general-purpose as possible? used for a batch of operations not just SIN/COS. and why most GPUs have a massive concerted effort to shrink transcendentals as much as possible. Mitch's patents (bought by Samsung) actually use... what is it... they use cubic equations, which is extremely efficient and, more than that, accurate to >1.0 ULP (unit in last place) for the majority of FP32 operations. he gets about a quarter of the gate count of other SIN/COS hard macros. if the gate count is too massive for FP64, we may *have* to use the FSM-based version for FP64. the priority as a GPU is on FP32.

btw 12 stages is too many, because we need a minimum of one FunctionUnit "managing" (matching up) source operands and results. each extra Function Unit added creates an Order N^2 increase in the overall size of the Dependency Matrices. so - and i don't mean right now - if this can be either: * cut into two separate pipelines (partial data fed back as a micro-op) * number of combinatorial blocks increased to 5 or 6 we stand a chance of keeping the FU count down (8 is ok). have to think that through carefully, how to do micro-ops.

the other option: keep the long pipeline lengths: we *know* that there's more stages than there are FUs, and that's just "tough luck". we know that with only 8 FUs and 12 pipeline stages, 4 of those at any one time will run empty, and we just... live with that.

(In reply to Luke Kenneth Casson Leighton from comment #50) > btw 12 stages is too many, because we need a minimum of one FunctionUnit > "managing" (matching up) source operands and results. > > each extra Function Unit added creates an Order N^2 increase in the overall > size of the Dependency Matrices. > > so - and i don't mean right now - if this can be either: > > * cut into two separate pipelines (partial data fed back as a micro-op) > * number of combinatorial blocks increased to 5 or 6 > > we stand a chance of keeping the FU count down (8 is ok). > > have to think that through carefully, how to do micro-ops. Sure. I don't know about splitting it up into multiple uops, but reducing the number of stages is pretty easy. I added a parameter to the module called rounds_per_stage which governs how many cordic rounds go in each pipeline stage. Increasing that number will decrease the number of pipeline stages. (In reply to Luke Kenneth Casson Leighton from comment #51) > the other option: > > keep the long pipeline lengths: we *know* that there's more stages than > there are FUs, and that's just "tough luck". we know that with only > 8 FUs and 12 pipeline stages, 4 of those at any one time will run > empty, and we just... live with that. I was going to ask if we could do something like this. Since I don't think the cordic will be used *that* often, this seems like a reasonable thing to do.

(In reply to Michael Nolan from comment #52) > (In reply to Luke Kenneth Casson Leighton from comment #50) > > btw 12 stages is too many, because we need a minimum of one FunctionUnit > > "managing" (matching up) source operands and results. > > > > each extra Function Unit added creates an Order N^2 increase in the overall > > size of the Dependency Matrices. > > > > so - and i don't mean right now - if this can be either: > > > > * cut into two separate pipelines (partial data fed back as a micro-op) > > * number of combinatorial blocks increased to 5 or 6 > > > > we stand a chance of keeping the FU count down (8 is ok). > > > > have to think that through carefully, how to do micro-ops. > > Sure. I don't know about splitting it up into multiple uops, neither do i at the moment! i think i might have got confused about sharing the MUL unit with FPMUL. > but reducing > the number of stages is pretty easy. I added a parameter to the module > called rounds_per_stage which governs how many cordic rounds go in each > pipeline stage. Increasing that number will decrease the number of pipeline > stages. fantastic, that's exactly the kind of parameterisation _we_ need, however as a general-purpose IEEE754 FP library, it's exactly the kind of parameterisation that other _users_ need :) > > (In reply to Luke Kenneth Casson Leighton from comment #51) > > the other option: > > > > keep the long pipeline lengths: we *know* that there's more stages than > > there are FUs, and that's just "tough luck". we know that with only > > 8 FUs and 12 pipeline stages, 4 of those at any one time will run > > empty, and we just... live with that. > > I was going to ask if we could do something like this. Since I don't think > the cordic will be used *that* often, this seems like a reasonable thing to > do. yehyeh. ok what else can we throw in here? LOG1P is probably a good thing to try this is a great explanation: http://www.mclenegan.com/public/thesis.pdf section 4.4.1 explains "rotation mode" however we also need *vector* mode (4.4.2) i know - somewhere - i've seen a python implementation of vector mode. it also specified, as you can see, the three other modes: circular, linear, hyperbolic. ah! i know why we want to keep the two answers: see p39 of that thesis: you can use them (with some post-processing) to do tan. also, see circular-vectoring mode, it ends up computing the normal distance and... and the angle? ohh, that's for... darn what was it... it's for arcs. there's a name for this... sorry i forget what it's called right now. https://www.ijert.org/vhdl-implementation-of-cordic-algorithm-for-wireless-lan this version, p14, explains much clearer that "rotate" mode goes by angle and computes x,y "vector" mode the vector is rotated to be flat against the x-axis, recording the angle (Z) needed to do so. this one is a survey of CORDIC algorithms: http://www.ee.ic.ac.uk/pcheung/teaching/ee3_DSD/crdcsrvy.pdf it shows - *briefly* - how to use them to do different things, but it also, by being a survey, has some really useful insights into various optimisations. so can i suggest, first adding the infrastructure back in to allow cancellation and operations (ctx), then see if you can add one extra "mode" (the m=-1, m=0, m=1 thing seems trivial which will give the circular, linear and hyperbolic modes, respectively). "circular" mode is covered with the output being SIN/COS. ah ha! here's a useful suite of implementations (annoyingly being in jupiter) https://github.com/suyashmahar/cordic-algorithm-python/blob/master/cordic_implementation.ipynb extractign the two modes: circular = 1 linear = 0 hyperbolic = -1 def ROM_lookup(iteration, coordinate): if (coordinate == circular): return math.degrees(math.atan(2**(-1*iteration))) elif (coordinate == linear): return 2**(-1*iteration) elif (coordinate == hyperbolic): return (math.atanh(2**(-1*iteration))) def rotation_mode(x, y, z, coordinate, iterations): a = 0.607252935; # = 1/K x_val_list = [] y_val_list = [] z_val_list = [] iterations_list = [] i = 0; # Keeps count on number of iterations current_x = x # Value of X on ith iteration current_y = y # Value of Y on ith iteration current_z = z # Value of Z on ith iteration di = 0 if (coordinate == hyperbolic): i = 1 else: i = 0 flag = 0 if (iterations > 0): while (i < iterations): if (current_z < 0): di = -1 else: di = +1 next_z = current_z - di * ROM_lookup(i, coordinate) next_x = current_x - coordinate * di * current_y * (2**(-1*i)) next_y = current_y + di * current_x * 2**(-1*i) current_x = next_x current_y = next_y current_z = next_z x_val_list.append(current_x) y_val_list.append(current_y) z_val_list.append(current_z) iterations_list.append(i) if (coordinate == hyperbolic): if ((i != 4) & (i != 13) & (i!=40)): i = i+1 elif (flag == 0): flag = 1 elif (flag == 1): flag = 0 i = i+1 else: i = i+1 return { 'x':x_val_list, 'y':y_val_list, 'z':z_val_list, 'iteration':iterations_list, } ahhh, i recognise this code :) i've seen it before. def vector_mode(x, y, z, coordinate, iterations): a = 1.2075; # = 1/K x_val_list = [] y_val_list = [] z_val_list = [] iterations_list = [] i = 0; # Keeps count on number of iterations current_x = x # Value of X on ith iteration current_y = y # Value of Y on ith iteration current_z = z # Value of Z on ith iteration di = 0 # This is neccesary since result for i=0 doesn't exists for hyperbolic # co-ordinate system. if (coordinate == hyperbolic): i = 1 else: i = 0 flag = 0 if (iterations > 0): while (i < iterations): di = -1*math.copysign(1, current_y);#*current_x); next_x = current_x - coordinate * di * current_y * (2**(-1*i)) next_y = current_y + di * current_x * 2**(-1*i) next_z = current_z - di * ROM_lookup(i, coordinate) current_x = next_x current_y = next_y current_z = next_z x_val_list.append(current_x) y_val_list.append(current_y) z_val_list.append(current_z) iterations_list.append(i) if (coordinate == hyperbolic): if ((i != 4) & (i != 13) & (i!=40)): i = i+1 elif (flag == 0): flag = 1 elif (flag == 1): flag = 0 i = i+1 else: i = i+1 return { 'x':x_val_list, 'y':y_val_list, 'z':z_val_list, 'iteration':iterations_list } if you can drop that into a repository, get it working, write a few simple experiments and see how it operates in each of the 6 modes, we're on a roll :)

(In reply to Luke Kenneth Casson Leighton from comment #53) > ok what else can we throw in here? LOG1P is probably a good thing to try > > i know - somewhere - i've seen a python implementation of vector mode. > it also specified, as you can see, the three other modes: circular, > linear, hyperbolic. > > ah! i know why we want to keep the two answers: see p39 of that > thesis: you can use them (with some post-processing) to do > tan. Yep, that was my thought. > "vector" mode the vector is rotated to be flat against the x-axis, > recording the angle (Z) needed to do so. > > > if you can drop that into a repository, get it working, write a > few simple experiments and see how it operates in each of the 6 > modes, we're on a roll :) I can give it a whirl I suppose

(In reply to Michael Nolan from comment #54) > > if you can drop that into a repository, get it working, write a > > few simple experiments and see how it operates in each of the 6 > > modes, we're on a roll :) > > I can give it a whirl I suppose when you have a moment, i noticed that the two functions are actually incredibly similar. line-by-line it is a multiply or add, here or there. btw i saw the commit putting ctx back in. to test that, cookie-cut FPADDMuxInOut in fpadd/pipeline.py and fpadd/test/test_fpadd_pipe.py the ReservationStations class takes two parameters: the bitwidth and the *number of Function Units*. it uses ctx to drop the FU number in along the pipeline, and on the output it uses that FU number to *re-associate* the result with the correct Function Unit. runfp knows exactly how ReservationStations works, so there is no need to write any actual unit test infrastructure. there is a single-argument version of runfp (i think!) so... oh wait, you call runfp with "singleop=True". then instead of passing in "add" as a function to compute the unit test result, you pass in "math.cos". that will throw a massive suite of corner-cases at the pipeline, so be prepared for it to fail as we still have to do INF, NAN, etc. :)