Bug 552 - single-predication has "splat" capability, needs review
Summary: single-predication has "splat" capability, needs review
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 213
  Show dependency treegraph
 
Reported: 2020-12-23 18:47 GMT by Luke Kenneth Casson Leighton
Modified: 2020-12-24 01:34 GMT (History)
2 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-12-23 18:47:34 GMT
function op_add(rd, rs1, rs2) # add not VADD!
  int i, id=0, irs1=0, irs2=0;
  predval = get_pred_val(FALSE, rd);
  rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
  for (i = 0; i < VL; i++)
    STATE.srcoffs = i # save context
    if (predval & 1<<i) # predication uses intregs
       ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
       if (!int_vec[rd ].isvector) break;
    if (int_vec[rd ].isvector)  { id += 1; }
    if (int_vec[rs1].isvector)  { irs1 += 1; }
    if (int_vec[rs2].isvector)  { irs2 += 1; }
    if (id == VL or irs1 == VL or irs2 == VL) {
      # end VL hardware loop
      STATE.srcoffs = 0; # reset
      STATE.ssvoffs = 0; # reset
      return;
    }
Comment 1 Luke Kenneth Casson Leighton 2020-12-23 21:01:35 GMT
right.  we need the following:

RT.v = RA.v RB.v
RT.v = RA.v RB.s and s/v
RT.v = RA.s RB.s
RT.s = RA.v RB.v
RT.s = RA.v RB.s and s/v
RT.s = RA.s RB.s

vvv no problem, obvious
vvs and vsv, no problem, obvious
sss also obvious

slightly less obvious:

RT.s = RA.v RB.v

which would take the first nonmaskedout elements from RA/RB vector srces

and 

RT.s = RA.v RB.s and s/v

which is a variant of the above, taking the  first nonmaskedout vector element from one of the srces.

so with some thought it really is just this one not obvious case:

RT.v = RA.s RB.s

which i originally expected would stop at the first element, but it could be interpreted as scalar-result-splat.

question: what behaviour do we want? what *actual* behaviour?

* scalar-scalar (yep, covered)
* vector-vector (likewise)
* picker from vector into scalar (covered)
* scalar insert into vector (covered by single bit of predicate)

well, all of those are covered regardless of how the ambiguous case works.

i.e. the "splat" capability, which can also be a masked-splat, is a superset of the single insert.

therefore, conclusion: good call, jacob, it's a good idea to have this.

inplementation-wise (high performance wise) it will be a pain, but we will manage, as i think it's worth it.

i will make a note in the spec.
Comment 2 Jacob Lifshay 2020-12-23 21:06:34 GMT
Note that splat will be very common in graphics code (I'd randomly guess 10-20% of instructions, though a lot of those can be done by having a scalar source on a vector op), so we will probably want to take the approach where we have the one scalar ALU and just write to multiple destination registers.
Comment 3 Luke Kenneth Casson Leighton 2020-12-23 21:26:15 GMT
(In reply to Jacob Lifshay from comment #2)
> Note that splat will be very common in graphics code (I'd randomly guess
> 10-20% of instructions, though a lot of those can be done by having a scalar
> source on a vector op), so we will probably want to take the approach where
> we have the one scalar ALU and just write to multiple destination registers.

this was the bit which was the "pain".  effectively that's a micro-coded op, separating out the actual scalar operation from the "copy-to-multiple".

which starts to get us into CISC territory as far as implementation is concerned.

let me think it through...

* result is produced
* then written to first dest (including CR)
* then a micro op "copy" splats it out (predicated).  including CR, here (arrrg)

if that is interrupted, it can be resumed at the copy phase as long as you can determine that the result was written.

that's going to be a pig, but it's doable.
 
spec updated btw.
Comment 4 Jacob Lifshay 2020-12-23 21:34:01 GMT
(In reply to Luke Kenneth Casson Leighton from comment #3)
> (In reply to Jacob Lifshay from comment #2)
> > Note that splat will be very common in graphics code (I'd randomly guess
> > 10-20% of instructions, though a lot of those can be done by having a scalar
> > source on a vector op), so we will probably want to take the approach where
> > we have the one scalar ALU and just write to multiple destination registers.
> 
> this was the bit which was the "pain".  effectively that's a micro-coded op,
> separating out the actual scalar operation from the "copy-to-multiple".
> 
> which starts to get us into CISC territory as far as implementation is
> concerned.
> 
> let me think it through...
> 
> * result is produced
> * then written to first dest (including CR)
> * then a micro op "copy" splats it out (predicated).  including CR, here
> (arrrg)
> 
> if that is interrupted, it can be resumed at the copy phase as long as you
> can determine that the result was written.
> 
> that's going to be a pig, but it's doable.

wouldn't it work to have the scalar op just have a whole pile of dest regs in the dependency matrix, and the data path can just use all 4 reg-file write buses enabled simultaneously, allowing 4 writes per clock cycle? It doesn't matter if we push the scalar op through the scalar ALU for as many clock cycles as needed, we don't have to have the scalar alu be used just once.

All I wanted to avoid is the scalar ALU having 1 op per element, taking 4x more cycles than needed.
Comment 5 Luke Kenneth Casson Leighton 2020-12-23 22:41:00 GMT
(In reply to Jacob Lifshay from comment #4)

> wouldn't it work to have the scalar op just have a whole pile of dest regs
> in the dependency matrix, and the data path can just use all 4 reg-file
> write buses enabled simultaneously, allowing 4 writes per clock cycle?

the DMs are so insanely large that i wanted to cut large holes in them by not having any lane-crossing entries.  this allows:

* every modulo 4 DM group to effectively have its own mini DM (4 of them: one when modulo regs is 0, separate for 1, and 2 and 3)

* the top regfile numbers become 4 separate batches of 4R1W.  not insane 12R10W.

writing to multiple destinations is therefore nowhere as easy as it sounds. "just" write to multiple destinations, when the output from MultiCompUnit is a single result?

clearly this does not work.

what *would* work is:

* under micro-coding the result is written into the first element
* subsequent micro-coded operations are a *mv* operation, using the "Whopping Great Shift Register FSM" the one wot has 12 incoming and 12 outgoing registers.

here, that can broadcast-splat the value across multiple lanes.

> It
> doesn't matter if we push the scalar op through the scalar ALU for as many
> clock cycles as needed, we don't have to have the scalar alu be used just
> once.
> 
> All I wanted to avoid is the scalar ALU having 1 op per element, taking 4x
> more cycles than needed.

lane crossing is always going to be a pig.

the choices are:

* insane regfile porting
* insane crossbar routing
* cyclic shift registers with latency
* single bus with go-get-a-coffee latency
* s*** out of luck
Comment 6 Luke Kenneth Casson Leighton 2020-12-24 01:34:15 GMT
so i don't know if you recall the first discussion when i came up with the cyclic shift register idea.  a conveyor belt attached between the regfiles and CompUnis, both read and write, which have on each a piece of "state" which is the difference between the port number on the regfile and the operand number on the CompUnit.

each shift reduces that state by one, when it reaches zero the value is "delivered".  what is nice is: if there are not enough free write ports the result keeps cycling until one *is* available.  also this same trick can be used for operand forwarding (including to multiple units).

this is all on the scalar side: the vector side if kept simple should not need it, and the Monster FSM has its own SR.

here's the thing: one-to-many delivery *might* and i stress might - be possible to jam in to the cyclic register system as part of Operand Forwarding.

however to keep the design from going completely insane it really does need the micro-coding, splitting out the generation of the scalar op from the broadcast copy/mv.

and, if done properly, the OoO Engine should sort things out as far as parallelism is concerned.

certainly if the number of vsplat dest-writes exceeds the number of regfile ports nothing is going to help get good performance except if by chance operand forwarding kicks in.