Bug 173 - dynamic partitioned "shift"
Summary: dynamic partitioned "shift"
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: ALU (including IEEE754 16/32/64-bit FPU) (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 132
  Show dependency tree
 
Reported: 2020-02-07 17:06 GMT by Luke Kenneth Casson Leighton
Modified: 2020-02-19 14:35 GMT (History)
3 users (show)

See Also:
NLnet milestone: NLnet.2019.02
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for completion of task (excludes budget allocated to subtasks): 700
parent task for budget allocation:
child tasks for budget allocation:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-02-07 17:06:44 GMT
dynamic shift needed
very similar to mul
https://libre-riscv.org/3d_gpu/architecture/dynamic_simd/shift/
Comment 1 Luke Kenneth Casson Leighton 2020-02-08 16:39:04 GMT
i've started adding in some tables which make it clear what the output is
in each partition.  it turns out that we'll need *three* variants of
partitioned shift:

* full dynamic (a and b are both PartitionedSignal)
* vector-scalar (a is PartitionedSignal but b is a "global" (ordinary Signal)
* vector-const.

this last one is much more straightforward than the others.  next order
of difficulty is the vector-scalar one, and the full dynamic one is a pig.
Comment 2 Luke Kenneth Casson Leighton 2020-02-10 20:33:37 GMT
michael, would you like to have a go at this one next?
i've put some notes on the wiki page as to how i think it
*might* be done.
Comment 3 Michael Nolan 2020-02-12 16:49:22 GMT
I've got good news and bad news. The good news is I have a dynamic shifter working in part_shift/part_shift_dynamic.py. The bad news is it works out to 8500 wire bits to do a 64 bit, 8 partition shifter (ouch!).

I currently have a gigantic switch statement to select which row of the matrix is needed for each element (similar to the old part_cmp), so optimizing this would probably help greatly with the number of gates needed.
Comment 4 Michael Nolan 2020-02-12 17:00:32 GMT
(In reply to Michael Nolan from comment #3)
> I've got good news and bad news. The good news is I have a dynamic shifter
> working in part_shift/part_shift_dynamic.py. The bad news is it works out to
> 8500 wire bits to do a 64 bit, 8 partition shifter (ouch!).

*gates, not wire bits
Comment 5 Luke Kenneth Casson Leighton 2020-02-12 17:22:17 GMT
(In reply to Michael Nolan from comment #3)
> I've got good news and bad news. The good news is I have a dynamic shifter
> working in part_shift/part_shift_dynamic.py. 

superb.

> The bad news is it works out to
> 8500 gates bits to do a 64 bit, 8 partition shifter (ouch!).

w00t! :)

> I currently have a gigantic switch statement to select which row of the
> matrix is needed for each element (similar to the old part_cmp), so
> optimizing this would probably help greatly with the number of gates needed.

the yosys graph is wonderful: like a dog munched your spaghetti bolognaise
dinner then threw it up when you held the dog up at waist-level and did a
nice pirouette :)

what would reduce the number of gates significantly is getting the size
of b down, in the matrices.  see "question: b1" here:
https://libre-riscv.org/3d_gpu/architecture/dynamic_simd/shift/

if you look just at the last... err... column (where you are creating
the outputs that you know are going to go into the MSB-partition
(MSP? Most Significant Partition?)

in that case, you know for a fact that shifting by greater than the width
of that partition (let's say it's 8 bits wide), you *know* that anything
that shifts by greater than 8 is always, always, always going to result
in that "contribution" from the shift-matrix being zero.

therefore, what you can do is *truncate* the B-input to 3 bits!

but... butbutbut, that's not quite the full story, because actually
what you want is

    MIN(B[partitionbits], 8).

NOT, repeat, NOT:

    B[partitionbits] & (0b111) # (8-1 in binary == 0b111 mask)

even when you are dealing with the LSB-partition (LSP? Least
Significant Partition), you definitely do not want "A << B",
you want "A << B[log2(len(output))-1:0]"

"A<<B" will try to generate 64-64 shifters.

A<< B[trunc] will generate (correctly) 64-7 shifters, because anything greater than 0b1000000 is *equivalent* to 0b1000000.

some shifters take 6 bits (not 7) and assume that 0b000000 actually means "0b1000000", others will add 1 to the B operand...

... let's not worry about that for now, though, and focus on getting
the B inputs truncated somewhat?
Comment 6 Luke Kenneth Casson Leighton 2020-02-12 17:32:11 GMT
wait... no... yes, it looks like you've already done that (to some extent),
at least limited the range of the matrix partial-results.

however what's missing is the truncation of B to match the length of the
output:

                comb += matrix[i][j].eq(a << b)
                start = end

should be something like:

                outwid = matrix[i][j].shape()[0]
                bwid = math.ceil(math.log2(outwid + 1))
                comb += matrix[i][j].eq(a << b[:bwid])
                start = end

or something like that.  there's almost certainly at least two off-by-one
bugs in that :)
Comment 7 Jacob Lifshay 2020-02-12 17:52:12 GMT
one important point to remember is that we need to be able to cheaply implement the shift instructions, on RISC-V, at least, the shift amount's high bits are specifically ignored, so shifting by n bits actually gives the result of shifting by n % bit_len where bit_len is the number of bits in the field/register being shifted.
Comment 8 Michael Nolan 2020-02-12 18:58:15 GMT
(In reply to Luke Kenneth Casson Leighton from comment #5)

> the yosys graph is wonderful: like a dog munched your spaghetti bolognaise
> dinner then threw it up when you held the dog up at waist-level and did a
> nice pirouette :)

ROFL, that's far more creative than most of my insults usually are. 
I would have just said it looks like sh*t.


(In reply to Luke Kenneth Casson Leighton from comment #6)
> wait... no... yes, it looks like you've already done that (to some extent),
> at least limited the range of the matrix partial-results.
> 
> however what's missing is the truncation of B to match the length of the
> output:
> 
>                 comb += matrix[i][j].eq(a << b)
>                 start = end
> 
> should be something like:
> 
>                 outwid = matrix[i][j].shape()[0]
>                 bwid = math.ceil(math.log2(outwid + 1))
>                 comb += matrix[i][j].eq(a << b[:bwid])
>                 start = end
> 
> or something like that.  there's almost certainly at least two off-by-one
> bugs in that :)

I suspect yosys is doing some of this already, since the size of my results are limited, but I can give it a try.
Comment 9 Michael Nolan 2020-02-12 19:20:59 GMT
(In reply to Luke Kenneth Casson Leighton from comment #6)
> wait... no... yes, it looks like you've already done that (to some extent),
> at least limited the range of the matrix partial-results.
> 
> however what's missing is the truncation of B to match the length of the
> output:
> or something like that.  there's almost certainly at least two off-by-one
> bugs in that :)

This didn't seem to make too much of a difference. Before: 
   Number of cells:               8608
After:
   Number of cells:               8523

I also eliminated unused entries from the shift matrix which did have some effect:
   Number of cells:               8047

I suspect there's a better way of calculating what matrix entry is needed than my giant-ass switch statement may help, it seems to be selecting the row corresponding to the last set bit of the gates.
Comment 10 Luke Kenneth Casson Leighton 2020-02-12 21:31:54 GMT
yeh i ran "proc" then "opt" followed by "show top" and woooo did it take
a looong time, the mux-cascades are absolutely frickin enormous.  one
of them must be 64 bits in length, which we can't possibly go live with,
the latency will be insane.

> it seems to be selecting the row corresponding to the last set bit of
> the gates.

that sounds about right, intuitively.
Comment 11 Luke Kenneth Casson Leighton 2020-02-13 14:58:57 GMT
ok i've managed to limit the bitwidth of partial result shift matrix size
to 64,56,48,...8 because obviously we don't need to include anything
beyond that which will go into 64 bit:

   xxxx xxxx xxxx xxxx
        xxxx xxxx xxxx xxxx
             xxxx xxxx xxxx xxxx
                  xxxx xxxx xxxx xxxx

only needs to be:
   xxxx xxxx xxxx xxxx
        xxxx xxxx xxxx
             xxxx xxxx
                  xxxx

then, based on that, i'm trying to restrict the amount by which
the B input actually shifts by, because xxxx << 4-or-greater is
clearly *always* going to be 0000 therefore why bother trying,
it's just wasting gates.

however, that bit's not working yet.  see "tshift".
Comment 12 Luke Kenneth Casson Leighton 2020-02-13 15:27:57 GMT
> however, that bit's not working yet.  see "tshift".

ha, correction, actually it is :)
Comment 13 Michael Nolan 2020-02-13 18:00:38 GMT
Looks like either a0faa5a or 6527ad39 made the problem slightly worse:
   Number of wires:               8866

the previous commit, bdca76073 had ~1k less gates:
   Number of wires:               7893
Comment 14 Luke Kenneth Casson Leighton 2020-02-13 18:17:07 GMT
(In reply to Michael Nolan from comment #13)
> Looks like either a0faa5a or 6527ad39 made the problem slightly worse:
>    Number of wires:               8866
> 
> the previous commit, bdca76073 had ~1k less gates:
>    Number of wires:               7893

we need to get rid of the switch statement.  we can't possibly have
a Mux-cascade of 64 Muxes, the gate latency will kill any possibility
of the chip running over about... 200 mhz.

i've just done the other three tables in
https://libre-riscv.org/3d_gpu/architecture/dynamic_simd/shift/
and the pattern is beginning to emerge.
Comment 15 Luke Kenneth Casson Leighton 2020-02-13 18:18:54 GMT
(in the gt_combiner etc. the mux-cascades are only 7 or 8 long,
and that is "acceptable" gate latency.)
Comment 16 Luke Kenneth Casson Leighton 2020-02-13 18:26:04 GMT
so it looks like you go:

* start with ~p0&~p1&~p2&~p3...&~pN, and start with M=0
* select aNbM and OR that into the output, then increment M by 1
* take the front of the AND-chain, remove the invert on the FIRST item
  (~p0 becomes p0)
* select aNbM and OR that into the output, then increment M by 1 more
* DROP the front of the AND-chain (p0 on the first iteration)
* repeat until you get down to a list of only one item, this should
  be when M=max: drop the invert and select aNbM.

am i making any sense? :)
Comment 17 Michael Nolan 2020-02-13 18:53:23 GMT
> it does... however i think the Switch statement really has to go.  if
> you
> run "proc" "opt" then "show top" on a 64-bit shifter, it's awful.
> the MUX chain is absolutely dreadful: each "switch" statement gets
> turned
> into a "if x == 0b00001 if x == 0b00010 if x == 0b000011"... with the
> results *chained* between each!

oh. 

> * start with ~p0&~p1&~p2&~p3...&~pN, and start with M=0
> * select aNbM and OR that into the output, then increment M by 1
> * take the front of the AND-chain, remove the invert on the FIRST item
>   (~p0 becomes p0)
> * select aNbM and OR that into the output, then increment M by 1 more
> * DROP the front of the AND-chain (p0 on the first iteration)
> * repeat until you get down to a list of only one item, this should
>   be when M=max: drop the invert and select aNbM.
> 
> am i making any sense? :)

sort of, I'll give it a closer look when I work on it tomorrow.
Comment 18 Luke Kenneth Casson Leighton 2020-02-13 19:08:40 GMT
ahno, it's even simpler than that.  it's:

* start off with every gate-bit inverted
* next loop: front of list is not-inverted, rest of list is inverted
* repeat until only one item left.

i'll write a quick bit of python to print out the table.
Comment 19 Luke Kenneth Casson Leighton 2020-02-13 19:52:46 GMT
ah!

def decoderange(col, k):
    xs = (col-k)*8
    return "%d:%d" % (xs+7, xs)

def decodelist(l):
    res = []
    for (idx, sgn) in l:
        sgn = " " if sgn else "~"
        res.append("%sp%d" % (sgn, idx))
    return '&'.join(res)

N = 4
M = 4

for col in range(M):
    r = decoderange(col, 0)
    print "if True: o%d = a0b0[%s]" % (col, r)
    for i in range(1,col+1):
        l = []
        for j in range(i):
            l.append([j, False])
        k = 0
        s = decodelist(l)
        r = decoderange(col, i)
        print "if %s: o%d |= a%db%d[%s]" % (s, col, i, k, r)
        k += 1
        while l:
            l[0][1] = True
            s = decodelist(l)
            r = decoderange(col, i)
            print "if %s: o%d |= a%db%d[%s]" % (s, col, i, k, r)
            k += 1
            del l[0]
    print
Comment 20 Luke Kenneth Casson Leighton 2020-02-13 22:50:02 GMT
(In reply to Michael Nolan from comment #17)

> > into a "if x == 0b00001 if x == 0b00010 if x == 0b000011"... with the
> > results *chained* between each!
> 
> oh. 

btw that's why i advised to run yosys show, however i realise now it should include running proc and opt.
Comment 21 Jacob Lifshay 2020-02-13 22:59:39 GMT
(In reply to Michael Nolan from comment #17)
> > it does... however i think the Switch statement really has to go.  if
> > you
> > run "proc" "opt" then "show top" on a 64-bit shifter, it's awful.
> > the MUX chain is absolutely dreadful: each "switch" statement gets
> > turned
> > into a "if x == 0b00001 if x == 0b00010 if x == 0b000011"... with the
> > results *chained* between each!

Note that there are specific yosys optimizations for mux trees that greatly simplify them: muxpack is probably the one you want to try

http://www.clifford.at/yosys/cmd_muxpack.html
Comment 22 Luke Kenneth Casson Leighton 2020-02-14 10:55:44 GMT
(In reply to Jacob Lifshay from comment #7)
> one important point to remember is that we need to be able to cheaply
> implement the shift instructions, on RISC-V, at least, the shift amount's
> high bits are specifically ignored, so shifting by n bits actually gives the
> result of shifting by n % bit_len where bit_len is the number of bits in the
> field/register being shifted.

this needs to go into all the unit tests, and it needs to be done dynamically.
so, when a *partition length* is 8, the bitmask is 0b111 on the b operand.
when a partition length is 32, the bitmask is 0b11111 on b.

this needs to be in the formal proof as well, michael.

btw we're spending a lot of time on this one, i'm upping the budget.
Comment 23 Jacob Lifshay 2020-02-14 11:05:25 GMT
(In reply to Luke Kenneth Casson Leighton from comment #22)
> this needs to go into all the unit tests, and it needs to be done
> dynamically.
> so, when a *partition length* is 8, the bitmask is 0b111 on the b operand.
> when a partition length is 32, the bitmask is 0b1111 on b.

assuming you mean 0b11111 for 32
Comment 24 Luke Kenneth Casson Leighton 2020-02-14 11:33:06 GMT
(In reply to Jacob Lifshay from comment #23)

> assuming you mean 0b11111 for 32

yep well spotted - edited
Comment 25 Michael Nolan 2020-02-14 15:49:37 GMT
I did something similar (I think?) to your description/python code, but broken up into a couple parts

I first created a table of n partial results like so:

# calculate shift amounts - 
s[0] = b[i]
s[i] = Mux(gates[i-1], b[i], s[i-1]) 

r[i] = a[i] << s[i] for i in range(n)

Then I calculated the results by:

o[0] = r[0]
o[1] = r[1] | Mux(gates[0], 0, r[0])
o[2] = r[2] | Mux(gates[1], 0, Mux(gates[0], 0, r[0]))
etc.

This reduced the gate count from 8k gates for a 64x8 shifter, to 1.7k, and I think I can do a bit better.
Comment 26 Luke Kenneth Casson Leighton 2020-02-14 17:39:23 GMT
(In reply to Michael Nolan from comment #25)
> I did something similar (I think?) to your description/python code, but
> broken up into a couple parts

excellent.  it's getting complicated (the yosys graphs are unreadable when
at 100% zoom) so once we're happy with it i'll
break it down into modules.

> I first created a table of n partial results like so:
> 
> # calculate shift amounts - 
> s[0] = b[i]
> s[i] = Mux(gates[i-1], b[i], s[i-1]) 
> 
> r[i] = a[i] << s[i] for i in range(n)

ah rats, you took out the "truncation" that i put on B.
ah well - that can (has to) go back in, btw, however with
the Switch statement removed that's well on the way
 
> Then I calculated the results by:
> 
> o[0] = r[0]
> o[1] = r[1] | Mux(gates[0], 0, r[0])
> o[2] = r[2] | Mux(gates[1], 0, Mux(gates[0], 0, r[0]))
> etc.
> 
> This reduced the gate count from 8k gates for a 64x8 shifter, to 1.7k, and I
> think I can do a bit better.

eeexxcellent, muhahaha

oh.. err... i thiiink i see what you did.  you actually selected
the correct "B" to perform the shift of the output, based on the
partition-info?

ee.... :)
Comment 27 Michael Nolan 2020-02-14 17:46:05 GMT
(In reply to Luke Kenneth Casson Leighton from comment #26)

> ah rats, you took out the "truncation" that i put on B.
> ah well - that can (has to) go back in, btw, however with
> the Switch statement removed that's well on the way

Yes, wanted to get it working without worrying about truncation complicating things. Working on adding it back in now though


> oh.. err... i thiiink i see what you did.  you actually selected
> the correct "B" to perform the shift of the output, based on the
> partition-info?
> 
> ee.... :)

Yup!
Comment 28 Luke Kenneth Casson Leighton 2020-02-14 18:18:17 GMT
(In reply to Michael Nolan from comment #27)
> (In reply to Luke Kenneth Casson Leighton from comment #26)
> 
> > ah rats, you took out the "truncation" that i put on B.
> > ah well - that can (has to) go back in, btw, however with
> > the Switch statement removed that's well on the way
> 
> Yes, wanted to get it working without worrying about truncation complicating
> things. Working on adding it back in now though

exccellent.  remember, it's *sigh* not as straightforward as truncating
(bit-wise)... well, you saw what i had in the original, with the comparators.

a *bit-wise* truncation also needs to be *shudder* based on the partition
length.

i think we can safely round-up 24-bit and other non-power-of-two partition
sizes, on the assumption that anyone trying to use this code for that purpose
is a bit... unhinged :)
Comment 29 Michael Nolan 2020-02-14 20:53:52 GMT
> exccellent.  remember, it's *sigh* not as straightforward as truncating
> (bit-wise)... well, you saw what i had in the original, with the comparators.

> a *bit-wise* truncation also needs to be *shudder* based on the partition
> length.

This has been added to the shifter as well as the proof in the latest commit
Comment 30 Luke Kenneth Casson Leighton 2020-02-14 21:12:54 GMT
(In reply to Michael Nolan from comment #29)

> > a *bit-wise* truncation also needs to be *shudder* based on the partition
> > length.
> 
> This has been added to the shifter as well as the proof in the latest commit

okaaay, i'll work it out for test_partsig.py
Comment 31 Luke Kenneth Casson Leighton 2020-02-14 22:04:31 GMT
(In reply to Luke Kenneth Casson Leighton from comment #30)

> okaaay, i'll work it out for test_partsig.py

got it.  bit messy, but it works.
Comment 32 Luke Kenneth Casson Leighton 2020-02-15 12:29:01 GMT
just doing some review / commenting / etc.  this one's quite important:

diff --git a/src/ieee754/part_shift/part_shift_dynamic.py b/src/ieee754/part_shift/part_shift_dynamic.py
index 5dd72ed..95c07d6 100644
--- a/src/ieee754/part_shift/part_shift_dynamic.py
+++ b/src/ieee754/part_shift/part_shift_dynamic.py
@@ -63,12 +63,14 @@ class PartitionedDynamicShift(Elaboratable):
         for i in range(len(b_intervals)):
             mask = Signal(b_intervals[i].shape(), name="shift_mask%d" % i,
                           reset_less=True)
-            bits = []
+            bits = Signal(gates.width-i+1, name="bits%d" % i, reset_less=True)
+            bl = []
             for j in range(i, gates.width):
-                if bits:
-                    bits.append(~gates[j] & bits[-1])
+                if bl:
+                    bl.append(~gates[j] & bits[j-i-1])
                 else:
-                    bits.append(~gates[j])
+                    bl.append(~gates[j])
+            comb += bits.eq(Cat(*bl))
             comb += mask.eq(Cat((1 << min_bits)-1, bits)
                             & ((1 << max_bits)-1))
             shifter_masks.append(mask)


if you "accumulate" something in a chain (bits, in this case), then assign it
later (in mask), what you find is that the entire *chain* is duplicated... *multiple times*.

by assigning each bit of "bits" to a Signal, and using the *signal* - not the chain - you don't end up with a massive cascading code-repetition.

i deployed this trick in another area of the code as well.  this one's
quite important: it looks "really simple" to just do bits = bits & foo
or bits.append(input & bits[-1]) however it's a pattern that really needs
to be avoided.

yes, found it:

            element = Mux(gates[i-1], masked, element)
            elmux = Signal(b_intervals[i].shape(), name="elmux%d" % i,
                          reset_less=True)
            comb += elmux.eq(element)
            element = elmux

that was another one where there was an accidental "repeated chain", this
time from the input to Mux().


i'm trying to identify where the max() thing can be put back in.  i think
it's where         shiftbits = math.ceil(math.log2(width))

that needs to be inside the loop, and it needs to be
     shiftbits = math.ceil(math.log2(width-intervals[i].start))

something like that.

then, the b partial... shifter... needs to be set not to this:

               comb += shifter.eq(element)

it needs to be

    m.If(element > shiftbits)
         comb += shifter.eq(shiftbits)
    m.Else()
         comb += shifter.eq(element)

i think that's it.

the reason is, there's absolutely no point trying to shift the data beyond
the limit of the partial.

also, partial needs to be limited to width-intervals[i].start

the last column is only 8 bits, for example, so why is the partial result
always set to 64 bit?
Comment 33 Luke Kenneth Casson Leighton 2020-02-15 12:43:10 GMT
another one.  again, the Mux-chain result = Mux(gates, partial, result)
was creating a massive chain that was being duplicated:

output[0] = result
output[1] = Mux(gates, partial, result)
output[2] = Mux(gates, partial, Mux(gates, partial, result))
...
output[N] = Mux(Mux(Mux(Mux(Mux(Mux(......))))))))))

this does *NOT* get "optimised out" so you have to be extremely careful
to make absolutely sure that the input on these kinds of chains is
*not* an "expression", but *only* a signal to which the expression has
been *assigned*.


index 9e214d0..aebcef5 100644
--- a/src/ieee754/part_shift/part_shift_dynamic.py
+++ b/src/ieee754/part_shift/part_shift_dynamic.py
@@ -145,6 +145,7 @@ class PartitionedDynamicShift(Elaboratable):
             print("select: [%d:%d]" % (start, end))
             res = Signal(width, name="res%d" % i, reset_less=True)
             comb += res.eq(result)
+            result = res
             s,e = intervals[0]
             out.append(res[s:e])
Comment 34 Luke Kenneth Casson Leighton 2020-02-15 17:42:52 GMT
ok i *think* i worked it out, and also did the split into separate modules.
it's not perfect by any means.

i tried running the static shift, it's not having any of it, btw.
i'll add it into test_partsig.py ready for when it works?
Comment 35 Luke Kenneth Casson Leighton 2020-02-17 10:53:10 GMT
https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_shift/part_shift_scalar.py;hb=HEAD

ah! needs to have the truncation on the b input, doesn't it?  same as in the dynamic one, depending on partition width?
Comment 36 Michael Nolan 2020-02-17 14:30:53 GMT
(In reply to Luke Kenneth Casson Leighton from comment #35)
> ah! needs to have the truncation on the b input, doesn't it?  same as in the
> dynamic one, depending on partition width?

Yep, you're right. Working on that now
Comment 37 Luke Kenneth Casson Leighton 2020-02-17 18:14:10 GMT
(In reply to Michael Nolan from comment #36)
> (In reply to Luke Kenneth Casson Leighton from comment #35)
> > ah! needs to have the truncation on the b input, doesn't it?  same as in the
> > dynamic one, depending on partition width?
> 
> Yep, you're right. Working on that now

haha that's funny, you ended up importing ShifterMask, i'd managed to
forget it could be used, even though i was just thinking, "hmm,
the partition mask code would end up being duplicated between both
modules" :)
Comment 38 Michael Nolan 2020-02-18 15:31:40 GMT
I'm guessing this will need to be pipelined at some point, would you like me to give that a try?
Comment 39 Luke Kenneth Casson Leighton 2020-02-18 21:36:43 GMT
(In reply to Michael Nolan from comment #38)
> I'm guessing this will need to be pipelined at some point, would you like me
> to give that a try?

well, before that, we need to see if it's actually needed, by checking the
longest path (longest chain).

you rarely hear of processors breaking down a 64 bit shift into pipelines.
looking at RocketChip's ALU, that's gone to 1.5ghz in 45nm and they've
not done pipelining.  and, because of the breakdown into smaller sections
we might actually have *reduced* latency.

btw scalar formal proof is failing, here
python3 ieee754/part_shift/formal/proof_shift_scalar.py 
SBY 21:35:29 [proof_shift_scalar_shift] base: starting process "cd proof_shift_scalar_shift/src; yosys -ql ../model/design.log ../model/design.ys"
SBY 21:35:29 [proof_shift_scalar_shift] base: Warning: Driver-driver conflict for $16 [15] between cell $17.Y and constant 1'0 in dut: Resolved using constant.
SBY 21:35:29 [proof_shift_scalar_shift] base: Warning: Driver-driver conflict for $16 [14] between cell $17.Y and constant 1'0 in dut: Resolved using constant.
...
...
SBY 21:35:29 [proof_shift_scalar_shift] engine_0: ##   0:00:00  Value for anyconst in top (ieee754/part_shift/formal/proof_shift_scalar.py:50): 0
SBY 21:35:29 [proof_shift_scalar_shift] engine_0: ##   0:00:00  Value for anyconst in top (ieee754/part_shift/formal/proof_shift_scalar.py:51): 0
SBY 21:35:29 [proof_shift_scalar_shift] engine_0: ##   0:00:00  Value for anyconst in top (ieee754/part_shift/formal/proof_shift_scalar.py:49): 32768
SBY 21:35:29 [proof_shift_scalar_shift] engine_0: ##   0:00:00  Assert failed in top: ieee754/part_shift/formal/proof_shift_scalar.py:67
Comment 40 Michael Nolan 2020-02-19 00:11:11 GMT
(In reply to Luke Kenneth Casson Leighton from comment #39)



> btw scalar formal proof is failing, here
> python3 ieee754/part_shift/formal/proof_shift_scalar.py 
> SBY 21:35:29 [proof_shift_scalar_shift] base: starting process "cd
> proof_shift_scalar_shift/src; yosys -ql ../model/design.log
> ../model/design.ys"
> SBY 21:35:29 [proof_shift_scalar_shift] base: Warning: Driver-driver
> conflict for $16 [15] between cell $17.Y and constant 1'0 in dut: Resolved
> using constant.
> SBY 21:35:29 [proof_shift_scalar_shift] base: Warning: Driver-driver
> conflict for $16 [14] between cell $17.Y and constant 1'0 in dut: Resolved
> using constant.

Both proofs work on my machine on commit d635235. I'm using yosys 0.9-1706 (b9dfdbbf).
Comment 41 Luke Kenneth Casson Leighton 2020-02-19 04:28:44 GMT
(In reply to Michael Nolan from comment #40)
> (In reply to Luke Kenneth Casson Leighton from comment #39)
> 
> 
> 
> > btw scalar formal proof is failing, here
> > python3 ieee754/part_shift/formal/proof_shift_scalar.py 
> > SBY 21:35:29 [proof_shift_scalar_shift] base: starting process "cd
> > proof_shift_scalar_shift/src; yosys -ql ../model/design.log
> > ../model/design.ys"
> > SBY 21:35:29 [proof_shift_scalar_shift] base: Warning: Driver-driver
> > conflict for $16 [15] between cell $17.Y and constant 1'0 in dut: Resolved
> > using constant.
> > SBY 21:35:29 [proof_shift_scalar_shift] base: Warning: Driver-driver
> > conflict for $16 [14] between cell $17.Y and constant 1'0 in dut: Resolved
> > using constant.
> 
> Both proofs work on my machine on commit d635235. I'm using yosys 0.9-1706
> (b9dfdbbf).

yehh that can happen and i am using 0.8. will upgrade see what happens
Comment 42 Luke Kenneth Casson Leighton 2020-02-19 14:07:27 GMT
ahh goood.

ok next thing, i get the unit test up and running, the one which recognises
Const.  then we can sign this one off.

lkcl@fizzy:~/src/libreriscv/ieee754fpu/src$ python3 ieee754/part_shift/formal/proof_shift_scalar.py 
{8: (slice (sig gates) 0:1), 16: (slice (sig gates) 1:2)}
.{8: (slice (sig gates) 0:1), 16: (slice (sig gates) 1:2)}
.
----------------------------------------------------------------------
Ran 2 tests in 0.307s

OK
lkcl@fizzy:~/src/libreriscv/ieee754fpu/src$ yosys -V
Yosys 0.9+1706 (git sha1 c7af1b22, clang 7.0.0-svn342187-1~exp1~20180919215158.32 -fPIC -Os)
Comment 43 Luke Kenneth Casson Leighton 2020-02-19 14:34:05 GMT
(In reply to Luke Kenneth Casson Leighton from comment #42)
> ahh goood.
> 
> ok next thing, i get the unit test up and running, the one which recognises
> Const. 

ok i got it to recognise Signal()

i'm happy with it.  next :)