Bug 653 - investigate FFT, DCT, etc for REMAP in SVP64
Summary: investigate FFT, DCT, etc for REMAP in SVP64
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL: https://libre-soc.org/openpower/sv/re...
Depends on: 658
Blocks: 645
  Show dependency treegraph
 
Reported: 2021-06-20 12:31 BST by Luke Kenneth Casson Leighton
Modified: 2021-10-03 22:25 BST (History)
2 users (show)

See Also:
NLnet milestone: NLNet.2019.10.Standards
total budget (EUR) for completion of task and all subtasks: 1600
budget (EUR) for this task, excluding subtasks' budget: 1600
parent task for budget allocation: 242
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
lkcl=1600


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2021-06-20 12:31:47 BST
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
https://www.programmersought.com/article/38835154723/
http://pepijndevos.nl/2018/07/04/loefflers-discrete-cosine-transform-algorithm-in-futhark.html
https://github.com/gk7huki/pydct/blob/master/dct.py
https://www.kaggle.com/thesherpafromalabama/2d-dct2-3-implementation
https://cs.stanford.edu/people/eroberts/courses/soco/projects/data-compression/lossy/jpeg/dct.htm
http://www-personal.umich.edu/~mejn/computational-physics/dcst.py
https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
https://www.spiral.net/doc/papers/ffte-spiral.pdf
https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft-complex.c
https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft-real-pair.c
https://www.nayuki.io/page/fast-discrete-cosine-transform-algorithms
https://par.nsf.gov/servlets/purl/10137134
https://github.com/HazyResearch/learning-circuits
https://www.codeproject.com/Articles/151043/Iterative-Fast-1D-Forvard-DCT
https://fgiesen.wordpress.com/2013/11/04/bink-2-2-integer-dct-design-part-1/
https://www.researchgate.net/publication/309704590_Early_Determination_of_Zero-Quantized_8_8_DCT_Coefficients
https://www.ijaiem.org/volume3issue5/IJAIEM-2014-05-31-117.pdf
https://www.codeproject.com/KB/recipes/Fast_DCT/Lee_s_algorithm.gif

reed solomon uses DFT in GF
https://github.com/tomerfiliba/reedsolomon/blob/master/reedsolo.py
https://github.com/Bulat-Ziganshin/FastECC/blob/master/ntt.cpp
https://github.com/lrq3000/pyFileFixity/blob/master/TODO.md
https://tmo.jpl.nasa.gov/progress_report2/42-35/35K.PDF

fast large multiply
https://arxiv.org/pdf/1503.04955
Comment 1 Luke Kenneth Casson Leighton 2021-06-22 14:54:47 BST
discussion and notes
http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-June/003198.html
Comment 2 Luke Kenneth Casson Leighton 2021-06-22 17:57:58 BST
see v3.0C p47

current idea being explored, a new Form
(SVD-Form?)
byte-reversed mode SVP64 a new form would be used:

OP RT RA RC immediate
0  6  11 16 21     31

this gives *both* an immediate *and* a (dynamic) shift amount, to be able to perform the following Effective Address calculation:

EA = RA +
     (bitrev(srcstep) * imm)
     << RC

* the bitrev is at width log2(VL) 
* imm will typically be width of the data
Comment 3 Luke Kenneth Casson Leighton 2021-06-22 19:23:09 BST
on the svp64 ldst imm mode
this line needs changing:

0-1	2	3 4	description
00	els	dz sz	normal mode

one option: when sz=dz=1, this enables bitreverse mode.  algorithm is:

# Returns the integer whose value is the reverse of the lowest 'width' bits of the integer 'val'.
def reverse_bits(val, width):
	result = 0
	for _ in range(width):
		result = (result << 1) | (val & 1)
		val >>= 1
	return result

where width = log2(VL).
Comment 4 Luke Kenneth Casson Leighton 2021-06-23 13:38:28 BST
* updated fields to add SVD and SVDS form
* added svfixedload pseudocode
* partly updated sv ld/st page
Comment 5 Luke Kenneth Casson Leighton 2021-06-23 19:47:16 BST
argh, the decode for LDST is... not very clean.  for the +/- twin arithemtic there is no change to the Forms, it is a matter of "if SVP64 Mode and OE=1 equals +/- twin arithmetic".

LD/ST is waay more involved, because the detection of whether SVP64 RM *itself* is in LD/ST Mode involves first identifying if the *v3.0B suffix* is a LD/ST operation.

once that is done, *then* bitreverse mode can be detected (in SVP64 RM)

once that is done, *then* the decoder can be flipped over to the new SVP64 LDST instruction forms.

argh.

that's awful.

the only saving grace is that the detection only needs to be on MAJOR opcodes, because only the major opcodes are big enough to fit LD/ST with immediate.
Comment 6 Luke Kenneth Casson Leighton 2021-06-23 22:12:27 BST
* added CSV files for major ld/st
* added CSV minor 58

both use new SVD and SVDS Form respectively
Comment 7 Luke Kenneth Casson Leighton 2021-06-26 18:49:53 BST
finally. working unit test for LD bit-reverse
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=8f224196e18dcea037ec6c8ea1b2e38e251fe401
Comment 8 Luke Kenneth Casson Leighton 2021-07-04 01:14:17 BST
https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/remap_fft_yield.py;hb=HEAD

FFT REMAP schedule sorted, started looking for butterfly variants of DCT.  these are not very common, most Lee implementations are recursive.  found one with a sloghtly dibious license, says "work shall not be used for immoral purposes"
Comment 10 Luke Kenneth Casson Leighton 2021-07-07 23:39:26 BST
butterfly schedule done and added to ISACaller, a "hack" instruction added which hardcodes a schedule, not ideal but it works.
Comment 11 Jacob Lifshay 2021-07-07 23:41:41 BST
(In reply to Luke Kenneth Casson Leighton from comment #9)
> https://en.wikipedia.org/wiki/Discrete_cosine_transform#/media/File:
> Single_butterfly_of_the_3-D_DCT-II_VR_DIF_algorithm.jpg

Better url:
https://en.wikipedia.org/wiki/File:Single_butterfly_of_the_3-D_DCT-II_VR_DIF_algorithm.jpg
Comment 12 Luke Kenneth Casson Leighton 2021-07-09 11:52:29 BST
progress made adding "Vertical-First" mode which hilariously is conceptually identical to Mitch Alsup's 66000 ISA "Vector Loop" system.
Comment 13 Luke Kenneth Casson Leighton 2021-07-16 23:03:58 BST
TODO, create partial recursive partial iterative FFT




import sys

from cmath import exp, pi

def fft(x):
    N = len(x)
    if N <= 1: 
        return x
    if N % 2 > 0:
        raise ValueError("size of x must be a power of 2")
    even = fft(x[::2])
    odd =  fft(x[1::2])
    r = range(N//2)
    T = [exp(-2j * pi * k / N) * odd[k] for k in r]
    [even[k] for k in r]
    res = ([even[k] + T[k] for k in r] +
           [even[k] - T[k] for k in r])
    return res

input_data = [1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0]
print(' '.join("%5.3f" % abs(f) for f in fft(input_data)))

clearer:

def fft(x):
    N = len(x)
    T = exp(-2*pi*1j/N)
    if N > 1:
        x = fft(x[::2]) + fft(x[1::2])
        for k in xrange(N/2):
            xk = x[k]
            x[k] = xk + T**k*x[k+N/2]
            x[k+N/2] = xk - T**k*x[k+N/2]
    return x
Comment 14 Luke Kenneth Casson Leighton 2021-07-17 21:35:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #13)
> TODO, create partial recursive partial iterative FFT

leaving this one alone for now, the simpler RADIX8 or RADIX16
is good enough.

moving on to DCT, the bink diagram and nayuki code
shows that DCT can be broken down into phases:

1) a butterfly with inversion on the top half
   indices, 0 1 2 3 7 6 5 4 where for the lower
   half the multiply constants can be set to zero

2) a bitreversed iterative sum on the top half
   of the output, and bitreversed copy on the lower
   half.

it is quite complex, but can be done in two halves.
it *might* be possible to do the 2nd part in-place,
if WaR hazards can be tolerated for the iterative
sum.
Comment 15 Luke Kenneth Casson Leighton 2021-07-17 22:51:15 BST
    result[0 : : 2] = alpha
    result[1 : : 2] = beta
    result[1 : n - 1 : 2] += beta[1 : ]

in theory this can be in-place if the bitreversing has
already been done.  this requires DOUBLE bitreversing:
the data to be LD-bitreversed AND the cos schedule
to be bitreversed, result is that the cos multiplies
end up in the correct indices.

for i in range(half):
    t1, t2 = vector[i], vector[n-i-1]
    k = (math.cos((i + 0.5) * math.pi / n) * 2.0)
    alpha[i] = t1 + t2
    beta[i] = (t1 - t2) * (1/k)
alpha = transform(alpha)
beta  = transform(beta )
result = [0] * n
for i in range(half):
    result[i*2] = alpha[i]
    result[i*2+1] = beta[i]
for i in range(half - 1):	
	result[i*2+1] += result[i*2+3]
Comment 16 Luke Kenneth Casson Leighton 2021-07-19 19:49:44 BST
have a demo DCT which is in-place, but as discussed on list has an annoying swap/copy which should be possible to get rid of but not straight away.

temporary workaround is to use VerticalFirst mode which needs detecting the inner loop end.
Comment 17 Luke Kenneth Casson Leighton 2021-07-23 12:04:17 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_dct_yield.py;hb=HEAD

done, got rid of the in-place swap, can now be done with a first
outer butterfly and an inner one, each with their own SVP64 remap
and a single instruction.
Comment 18 Luke Kenneth Casson Leighton 2021-07-24 08:38:30 BST
(In reply to Luke Kenneth Casson Leighton from comment #17)
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/
> decoder/isa/remap_dct_yield.py;hb=HEAD
> 
> done, got rid of the in-place swap, can now be done with a first
> outer butterfly and an inner one, each with their own SVP64 remap
> and a single instruction.

combined them both into another unit test, show the inner and outer
butterfly back-to-back
Comment 19 Luke Kenneth Casson Leighton 2021-07-24 10:33:32 BST
the next stages here are:

* calculate the COS table to be stored in regs and used horizontally
* calculate the COS coefficient on-demand i.e. Vertical-First

the only thing is, the coefficients need to be computed from an integer
which is part of the inner loop counter.

to get at it, an "iota" instruction is needed.  or, more specifically,
a "get srcstep / dststep" instruction.  the Vectorised version of this
would then get internal REMAP schedule loop indices

from there, those integers can be converted to floats, and
the calculation "cos (ci + 0.5) * pi / size" can be done.

quite a lot of instructions are needed which are not yet
implemented, including the new FP into INT ones.
Comment 20 Luke Kenneth Casson Leighton 2021-07-28 15:53:19 BST
notes on progress:

* cos table precalculation schedule added, this is a
  power-2 pascal triangle 4 + 2 + 1 rather than
  4 + 2+2 + 1+1+1+1
* LD-BITREV needed fixing to work with REMAP. it is
  now the *memory offset* (the immediate) that is BOTH
  bitrev'd AND REMAPed where previously it could only
  be the RA register and of course that is a scalar only
  so both brev and REMAP are meaningless for it.

there are a LOT of modes needed here. it may be necessary
to reorganise the bits around for the instructions and submodes
Comment 21 Luke Kenneth Casson Leighton 2021-08-01 16:57:10 BST
LD-bitreverse does not work for iDCT due to needing to be bitreversed *and* half-swapped... those operations being applied in the *opposite* order from DCT.

therefore LD-bitreverse needs to be downgraded to LD-shift and it is REMAP that needs the bitrev-halfswap and halfswap-bitrev modes.

this makes LD-shift less of a priority
Comment 22 Luke Kenneth Casson Leighton 2021-08-02 22:20:07 BST
LDST downgraded, REMAP upgraded to compensate.
initial exploration almost complete, NTT will have
to wait until GF ops are available.
next is SVG autogeneration.
Comment 23 Luke Kenneth Casson Leighton 2021-08-10 18:58:19 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=413c083e4dd8452fae7c4e426851f1c19a7a9b8c

DCT SVG image generator