function op_add(rd, rs1, rs2) # add not VADD! int i, id=0, irs1=0, irs2=0; predval = get_pred_val(FALSE, rd); rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd; rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1; rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2; for (i = 0; i < VL; i++) STATE.srcoffs = i # save context if (predval & 1<<i) # predication uses intregs ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2]; if (!int_vec[rd ].isvector) break; if (int_vec[rd ].isvector) { id += 1; } if (int_vec[rs1].isvector) { irs1 += 1; } if (int_vec[rs2].isvector) { irs2 += 1; } if (id == VL or irs1 == VL or irs2 == VL) { # end VL hardware loop STATE.srcoffs = 0; # reset STATE.ssvoffs = 0; # reset return; }
right. we need the following: RT.v = RA.v RB.v RT.v = RA.v RB.s and s/v RT.v = RA.s RB.s RT.s = RA.v RB.v RT.s = RA.v RB.s and s/v RT.s = RA.s RB.s vvv no problem, obvious vvs and vsv, no problem, obvious sss also obvious slightly less obvious: RT.s = RA.v RB.v which would take the first nonmaskedout elements from RA/RB vector srces and RT.s = RA.v RB.s and s/v which is a variant of the above, taking the first nonmaskedout vector element from one of the srces. so with some thought it really is just this one not obvious case: RT.v = RA.s RB.s which i originally expected would stop at the first element, but it could be interpreted as scalar-result-splat. question: what behaviour do we want? what *actual* behaviour? * scalar-scalar (yep, covered) * vector-vector (likewise) * picker from vector into scalar (covered) * scalar insert into vector (covered by single bit of predicate) well, all of those are covered regardless of how the ambiguous case works. i.e. the "splat" capability, which can also be a masked-splat, is a superset of the single insert. therefore, conclusion: good call, jacob, it's a good idea to have this. inplementation-wise (high performance wise) it will be a pain, but we will manage, as i think it's worth it. i will make a note in the spec.
Note that splat will be very common in graphics code (I'd randomly guess 10-20% of instructions, though a lot of those can be done by having a scalar source on a vector op), so we will probably want to take the approach where we have the one scalar ALU and just write to multiple destination registers.
(In reply to Jacob Lifshay from comment #2) > Note that splat will be very common in graphics code (I'd randomly guess > 10-20% of instructions, though a lot of those can be done by having a scalar > source on a vector op), so we will probably want to take the approach where > we have the one scalar ALU and just write to multiple destination registers. this was the bit which was the "pain". effectively that's a micro-coded op, separating out the actual scalar operation from the "copy-to-multiple". which starts to get us into CISC territory as far as implementation is concerned. let me think it through... * result is produced * then written to first dest (including CR) * then a micro op "copy" splats it out (predicated). including CR, here (arrrg) if that is interrupted, it can be resumed at the copy phase as long as you can determine that the result was written. that's going to be a pig, but it's doable. spec updated btw.
(In reply to Luke Kenneth Casson Leighton from comment #3) > (In reply to Jacob Lifshay from comment #2) > > Note that splat will be very common in graphics code (I'd randomly guess > > 10-20% of instructions, though a lot of those can be done by having a scalar > > source on a vector op), so we will probably want to take the approach where > > we have the one scalar ALU and just write to multiple destination registers. > > this was the bit which was the "pain". effectively that's a micro-coded op, > separating out the actual scalar operation from the "copy-to-multiple". > > which starts to get us into CISC territory as far as implementation is > concerned. > > let me think it through... > > * result is produced > * then written to first dest (including CR) > * then a micro op "copy" splats it out (predicated). including CR, here > (arrrg) > > if that is interrupted, it can be resumed at the copy phase as long as you > can determine that the result was written. > > that's going to be a pig, but it's doable. wouldn't it work to have the scalar op just have a whole pile of dest regs in the dependency matrix, and the data path can just use all 4 reg-file write buses enabled simultaneously, allowing 4 writes per clock cycle? It doesn't matter if we push the scalar op through the scalar ALU for as many clock cycles as needed, we don't have to have the scalar alu be used just once. All I wanted to avoid is the scalar ALU having 1 op per element, taking 4x more cycles than needed.
(In reply to Jacob Lifshay from comment #4) > wouldn't it work to have the scalar op just have a whole pile of dest regs > in the dependency matrix, and the data path can just use all 4 reg-file > write buses enabled simultaneously, allowing 4 writes per clock cycle? the DMs are so insanely large that i wanted to cut large holes in them by not having any lane-crossing entries. this allows: * every modulo 4 DM group to effectively have its own mini DM (4 of them: one when modulo regs is 0, separate for 1, and 2 and 3) * the top regfile numbers become 4 separate batches of 4R1W. not insane 12R10W. writing to multiple destinations is therefore nowhere as easy as it sounds. "just" write to multiple destinations, when the output from MultiCompUnit is a single result? clearly this does not work. what *would* work is: * under micro-coding the result is written into the first element * subsequent micro-coded operations are a *mv* operation, using the "Whopping Great Shift Register FSM" the one wot has 12 incoming and 12 outgoing registers. here, that can broadcast-splat the value across multiple lanes. > It > doesn't matter if we push the scalar op through the scalar ALU for as many > clock cycles as needed, we don't have to have the scalar alu be used just > once. > > All I wanted to avoid is the scalar ALU having 1 op per element, taking 4x > more cycles than needed. lane crossing is always going to be a pig. the choices are: * insane regfile porting * insane crossbar routing * cyclic shift registers with latency * single bus with go-get-a-coffee latency * s*** out of luck
so i don't know if you recall the first discussion when i came up with the cyclic shift register idea. a conveyor belt attached between the regfiles and CompUnis, both read and write, which have on each a piece of "state" which is the difference between the port number on the regfile and the operand number on the CompUnit. each shift reduces that state by one, when it reaches zero the value is "delivered". what is nice is: if there are not enough free write ports the result keeps cycling until one *is* available. also this same trick can be used for operand forwarding (including to multiple units). this is all on the scalar side: the vector side if kept simple should not need it, and the Monster FSM has its own SR. here's the thing: one-to-many delivery *might* and i stress might - be possible to jam in to the cyclic register system as part of Operand Forwarding. however to keep the design from going completely insane it really does need the micro-coding, splitting out the generation of the scalar op from the broadcast copy/mv. and, if done properly, the OoO Engine should sort things out as far as parallelism is concerned. certainly if the number of vsplat dest-writes exceeds the number of regfile ports nothing is going to help get good performance except if by chance operand forwarding kicks in.