reed solomon uses DFT in GF
fast large multiply
discussion and notes
see v3.0C p47
current idea being explored, a new Form
byte-reversed mode SVP64 a new form would be used:
OP RT RA RC immediate
0 6 11 16 21 31
this gives *both* an immediate *and* a (dynamic) shift amount, to be able to perform the following Effective Address calculation:
EA = RA +
(bitrev(srcstep) * imm)
* the bitrev is at width log2(VL)
* imm will typically be width of the data
on the svp64 ldst imm mode
this line needs changing:
0-1 2 3 4 description
00 els dz sz normal mode
one option: when sz=dz=1, this enables bitreverse mode. algorithm is:
# Returns the integer whose value is the reverse of the lowest 'width' bits of the integer 'val'.
def reverse_bits(val, width):
result = 0
for _ in range(width):
result = (result << 1) | (val & 1)
val >>= 1
where width = log2(VL).
* updated fields to add SVD and SVDS form
* added svfixedload pseudocode
* partly updated sv ld/st page
argh, the decode for LDST is... not very clean. for the +/- twin arithemtic there is no change to the Forms, it is a matter of "if SVP64 Mode and OE=1 equals +/- twin arithmetic".
LD/ST is waay more involved, because the detection of whether SVP64 RM *itself* is in LD/ST Mode involves first identifying if the *v3.0B suffix* is a LD/ST operation.
once that is done, *then* bitreverse mode can be detected (in SVP64 RM)
once that is done, *then* the decoder can be flipped over to the new SVP64 LDST instruction forms.
the only saving grace is that the detection only needs to be on MAJOR opcodes, because only the major opcodes are big enough to fit LD/ST with immediate.
* added CSV files for major ld/st
* added CSV minor 58
both use new SVD and SVDS Form respectively
finally. working unit test for LD bit-reverse
FFT REMAP schedule sorted, started looking for butterfly variants of DCT. these are not very common, most Lee implementations are recursive. found one with a sloghtly dibious license, says "work shall not be used for immoral purposes"
butterfly schedule done and added to ISACaller, a "hack" instruction added which hardcodes a schedule, not ideal but it works.
(In reply to Luke Kenneth Casson Leighton from comment #9)
progress made adding "Vertical-First" mode which hilariously is conceptually identical to Mitch Alsup's 66000 ISA "Vector Loop" system.
TODO, create partial recursive partial iterative FFT
from cmath import exp, pi
N = len(x)
if N <= 1:
if N % 2 > 0:
raise ValueError("size of x must be a power of 2")
even = fft(x[::2])
odd = fft(x[1::2])
r = range(N//2)
T = [exp(-2j * pi * k / N) * odd[k] for k in r]
[even[k] for k in r]
res = ([even[k] + T[k] for k in r] +
[even[k] - T[k] for k in r])
input_data = [1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0]
print(' '.join("%5.3f" % abs(f) for f in fft(input_data)))
N = len(x)
T = exp(-2*pi*1j/N)
if N > 1:
x = fft(x[::2]) + fft(x[1::2])
for k in xrange(N/2):
xk = x[k]
x[k] = xk + T**k*x[k+N/2]
x[k+N/2] = xk - T**k*x[k+N/2]
(In reply to Luke Kenneth Casson Leighton from comment #13)
> TODO, create partial recursive partial iterative FFT
leaving this one alone for now, the simpler RADIX8 or RADIX16
is good enough.
moving on to DCT, the bink diagram and nayuki code
shows that DCT can be broken down into phases:
1) a butterfly with inversion on the top half
indices, 0 1 2 3 7 6 5 4 where for the lower
half the multiply constants can be set to zero
2) a bitreversed iterative sum on the top half
of the output, and bitreversed copy on the lower
it is quite complex, but can be done in two halves.
it *might* be possible to do the 2nd part in-place,
if WaR hazards can be tolerated for the iterative
result[0 : : 2] = alpha
result[1 : : 2] = beta
result[1 : n - 1 : 2] += beta[1 : ]
in theory this can be in-place if the bitreversing has
already been done. this requires DOUBLE bitreversing:
the data to be LD-bitreversed AND the cos schedule
to be bitreversed, result is that the cos multiplies
end up in the correct indices.
for i in range(half):
t1, t2 = vector[i], vector[n-i-1]
k = (math.cos((i + 0.5) * math.pi / n) * 2.0)
alpha[i] = t1 + t2
beta[i] = (t1 - t2) * (1/k)
alpha = transform(alpha)
beta = transform(beta )
result =  * n
for i in range(half):
result[i*2] = alpha[i]
result[i*2+1] = beta[i]
for i in range(half - 1):
result[i*2+1] += result[i*2+3]
have a demo DCT which is in-place, but as discussed on list has an annoying swap/copy which should be possible to get rid of but not straight away.
temporary workaround is to use VerticalFirst mode which needs detecting the inner loop end.
done, got rid of the in-place swap, can now be done with a first
outer butterfly and an inner one, each with their own SVP64 remap
and a single instruction.
(In reply to Luke Kenneth Casson Leighton from comment #17)
> done, got rid of the in-place swap, can now be done with a first
> outer butterfly and an inner one, each with their own SVP64 remap
> and a single instruction.
combined them both into another unit test, show the inner and outer
the next stages here are:
* calculate the COS table to be stored in regs and used horizontally
* calculate the COS coefficient on-demand i.e. Vertical-First
the only thing is, the coefficients need to be computed from an integer
which is part of the inner loop counter.
to get at it, an "iota" instruction is needed. or, more specifically,
a "get srcstep / dststep" instruction. the Vectorised version of this
would then get internal REMAP schedule loop indices
from there, those integers can be converted to floats, and
the calculation "cos (ci + 0.5) * pi / size" can be done.
quite a lot of instructions are needed which are not yet
implemented, including the new FP into INT ones.
notes on progress:
* cos table precalculation schedule added, this is a
power-2 pascal triangle 4 + 2 + 1 rather than
4 + 2+2 + 1+1+1+1
* LD-BITREV needed fixing to work with REMAP. it is
now the *memory offset* (the immediate) that is BOTH
bitrev'd AND REMAPed where previously it could only
be the RA register and of course that is a scalar only
so both brev and REMAP are meaningless for it.
there are a LOT of modes needed here. it may be necessary
to reorganise the bits around for the instructions and submodes
LD-bitreverse does not work for iDCT due to needing to be bitreversed *and* half-swapped... those operations being applied in the *opposite* order from DCT.
therefore LD-bitreverse needs to be downgraded to LD-shift and it is REMAP that needs the bitrev-halfswap and halfswap-bitrev modes.
this makes LD-shift less of a priority
LDST downgraded, REMAP upgraded to compensate.
initial exploration almost complete, NTT will have
to wait until GF ops are available.
next is SVG autogeneration.
DCT SVG image generator