168 – create naturally aligned partition points

Bug 168 - create naturally aligned partition points

Summary: create naturally aligned partition points

Status:	CONFIRMED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	ALU (including IEEE754 16/32/64-bit FPU) (show other bugs)
Version:	unspecified
Hardware:	Other Linux

Importance:	--- enhancement
Assignee:	Luke Kenneth Casson Leighton

URL:

Depends on:
Blocks:	132
	Show dependency tree / graph

Reported:	2020-02-03 15:06 GMT by Luke Kenneth Casson Leighton
Modified:	2020-02-13 18:09 GMT (History)
CC List:	2 users (show)

See Also:
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2020-02-03 15:06:13 GMT

see http://bugs.libre-riscv.org/show_bug.cgi?id=132#c9

Comment 1 Michael Nolan 2020-02-13 13:50:47 GMT

I had an idea about this last night. If it were acceptable that all the partitions be the same size, then something like this seems reasonable:

   c   b   c   a   c   b   c
xxx|xxx|xxx|xxx|xxx|xxx|xxx|xxx

partition bits: abc

Guaranteeing that all the partitions are the same size and naturally aligned is just a matter of ensuring that if a partition bit is set, then all previous partition bits must also be set.

If we need to have different width partitions, something like this might work, though it's a bit messier:
   a   b   c   d   e   f   g
xxx|xxx|xxx|xxx|xxx|xxx|xxx|xxx

partition bits: dbfaceg
or 
partition bits: dbacfeg

My idea was that it'd be sort of like a serialized binary tree. To ensure that the partitions are naturally aligned, a given bit can only be set if its parent bits are set, which seems harder to check/enforce than the method above

Comment 2 Luke Kenneth Casson Leighton 2020-02-13 14:19:47 GMT

ok so the idea is, that the only options for partition sizes are:

* 64
* 32,32
* 16,16,16,16
* 8,8,8,8,8,8,8,8

is that the idea?

if so, this restricts us to not being able to run 32-bit arithmetic in one "lane" and 16-16 bit arithmetic in the other.

the way that the 6600 engine and register file is to be arranged is that
the register file is subdivided 32-HI, 32-LO times two.  so, four ports,
but only 32-bit-wide.  if you want to do 64-bit arithmetic, you have to
use *two* of those ports.

also, each *byte* of the register file has its own write-enable line.

this so that on vectorised instructions, if there are 32-bit instructions
that happen to hit the 32-LO register port, the 32-*HI* port can be used
for 32, 16-16, 8-8-8-8 *completely different* instructions that *HAPPEN*
to occur (or are deliberately arranged to occur) on the exact same cycle
and happen to be the exact same operation.

now, whether these conditions turn out to be reasonable or not is another
matter, hence why, yeah, it should be fine to consider this, and thus
perhaps greatly simplify the partitioning.

would we end up with a huge number of 32-bit-adds mixed in with 8-8-8-8
adds?  i don't honestly know.

Comment 3 Michael Nolan 2020-02-13 17:32:33 GMT

(In reply to Luke Kenneth Casson Leighton from comment #2)
> ok so the idea is, that the only options for partition sizes are:
> 
> * 64
> * 32,32
> * 16,16,16,16
> * 8,8,8,8,8,8,8,8
> 
> is that the idea?

Yes

> 
> if so, this restricts us to not being able to run 32-bit arithmetic in one
> "lane" and 16-16 bit arithmetic in the other.

> 
> this so that on vectorised instructions, if there are 32-bit instructions
> that happen to hit the 32-LO register port, the 32-*HI* port can be used
> for 32, 16-16, 8-8-8-8 *completely different* instructions that *HAPPEN*
> to occur (or are deliberately arranged to occur) on the exact same cycle
> and happen to be the exact same operation.
> 
> now, whether these conditions turn out to be reasonable or not is another
> matter, hence why, yeah, it should be fine to consider this, and thus
> perhaps greatly simplify the partitioning.
> 
> would we end up with a huge number of 32-bit-adds mixed in with 8-8-8-8
> adds?  i don't honestly know.

This would make scheduling a bit more complicated, but it might be beneficial to do this only for some modules. I don't think it'd make a huge difference for the adder or comparator to use an aligned partition, but it might simplify the shifter a good bit (because it eliminates a couple of the matrix entries).

Comment 4 Luke Kenneth Casson Leighton 2020-02-13 17:55:18 GMT

(In reply to Michael Nolan from comment #3)
> (In reply to Luke Kenneth Casson Leighton from comment #2)
> > ok so the idea is, that the only options for partition sizes are:
> > 
> > * 64
> > * 32,32
> > * 16,16,16,16
> > * 8,8,8,8,8,8,8,8
> > 
> > is that the idea?
> 
> Yes
> 
> > 
> > if so, this restricts us to not being able to run 32-bit arithmetic in one
> > "lane" and 16-16 bit arithmetic in the other.
> 
> > 
> > this so that on vectorised instructions, if there are 32-bit instructions
> > that happen to hit the 32-LO register port, the 32-*HI* port can be used
> > for 32, 16-16, 8-8-8-8 *completely different* instructions that *HAPPEN*
> > to occur (or are deliberately arranged to occur) on the exact same cycle
> > and happen to be the exact same operation.
> > 
> > now, whether these conditions turn out to be reasonable or not is another
> > matter, hence why, yeah, it should be fine to consider this, and thus
> > perhaps greatly simplify the partitioning.
> > 
> > would we end up with a huge number of 32-bit-adds mixed in with 8-8-8-8
> > adds?  i don't honestly know.
> 
> This would make scheduling a bit more complicated,

i'm anticipating it to be quite straightforward, by way of pushing the
"predicate bits" directly into the regfile write-enable lines, and to
breaking down operations into 32-bit "chunks".

so, where a sequence of elements (say 16 bit) are to be ADDed, that will
be "converted" into 2x 16-16 SIMD operations: one will go to HI-32 regfile,
the other will go to LO-32 regfile.

it's pretty straightforward.  it'll be slightly wasteful where the vector
length is not an exact multiple of 32-bits (3x8 for example) however as
a first iteration i'm not that concerned.


> but it might be beneficial to do this only for some modules. 

honestly it would complicate the decode phase, along these lines:
"if operation == NOT_CAPABLE_OF_DYNAMIC_PARTITIONING { do something else }"

whether that's ok compared to the complexity of the partitioned ALU ops?

> I don't think it'd make a huge
> difference for the adder or comparator to use an aligned partition, but it
> might simplify the shifter a good bit (because it eliminates a couple of the
> matrix entries).

it does... however i think the Switch statement really has to go.  if you
run "proc" "opt" then "show top" on a 64-bit shifter, it's awful.
the MUX chain is absolutely dreadful: each "switch" statement gets turned
into a "if x == 0b00001 if x == 0b00010 if x == 0b000011"... with the
results *chained* between each!

by comparison, for the gt_combiner, the mux-chains aren't 64-bit long,
they're only 7-long, because they're done on the *partition* gates,
not per-permutation-of-all-values-of-partitions.

you did manage to convert the "switch statement" from the original that
i did, of eq_combiner, and i am confident that the same thing can be done
here, based on the tables:

https://libre-riscv.org/3d_gpu/architecture/dynamic_simd/shift/

the only thing being, each table (each column output, o0...o7) is
computed independently, you can't share data *between* each column,
and that's fine.

Comment 5 Luke Kenneth Casson Leighton 2020-02-13 18:09:35 GMT

okaaay i decoded those tables a bit, into "if" statements:

    if True:          o3 = a0b0[31:24]
    if ~p0:           o3 |= a1b0[23:16]
    if  p0:           o3 |= a1b1[23:16]
    if ~p0&p1:        o3 |= a2b1[15:8]
    if  p0&p1:        o3 |= a2b0[15:8]
    if      p1:       o3 |= a2b2[15:8]
    if ~p0&~p1&~p2:   o3 |= a3b0
    if  p0&~p1&~p2:   o3 |= a3b1
    if      p1&~p2:   o3 |= a3b2
    if          p2:   o3 |= a3b3

that's quite easy to do as a parallel tree of ORs (see the treereduce function
i created which should be moved to nmutil, really)
https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/regfile/regfile.py;h=b1d6f1c6717351f7b38be806bb25462e51f3bbf0;hb=6f35af64465700971bce1e4dae1f7e69c2ec25ad#l68

aNbM is basically matrix[N][M]

is that beginning to look a little clearer?  i'll do the other tables as well