From discussion with Luke: an actual pipelined ALU, that expands on the alu_hier.py concept and adds more functions - however it also needs to use PartitionedSignal. this can default to "all off" initially. Additionally, make certain to use the ReservationStations base class and test infrastructure (runfp) exactly as i described in here: https://bugs.libre-soc.org/show_bug.cgi?id=208#c55 *all* ALUs need to have that common interface (multiple inputs-outputs). if you can, use runfp - if there's any assumptions about the format being FP numbers, add an extra parameter which does debug printing as hexadecimal. i *think* it should do that anyway because i used the same runfp test code for FP-to-INT and INT-to-FP
http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-May/006371.html
From email: I noticed alu_hier was nicely broken up into different function units (adder, branch unit, multiplier, etc.). Is there a nice way you know of to do that with the nmutil pipeline stages? Like a way to merge stage 1 of the adder pipe with stage one of the branch pipe to get the stage 1 of the alu?
Luke's Reply: On Fri, May 8, 2020 at 3:03 PM Michael N <mtnolan2640@gmail.com> wrote: >> On May 8, 2020, at 9:28 AM, Michael Nolan <mtnolan2640@gmail.com> wrote: >> >> Now that I have branches working in the simulator, I was thinking of starting work on the integer ALU. Taking a look at alu_hier.py, I think I'll keep the interface but replace the implementation with a pipelined one. >> >> --Michael >> > I noticed alu_hier was nicely broken up into different function units (adder, branch unit, multiplier, etc.). yes, with an "op" - which in our case will basically be "decoder.InternalOp" > Is there a nice way you know of to do that with the nmutil pipeline stages? yyyess.... except it's too complicated to explain right just this minute. raise the question in the bugreport. > Like a way to merge stage 1 of the adder pipe with stage one of the branch pipe to get the stage 1 of the alu? ok if you are talking about *outside* of ReservationStations, you are mixing "purposes". the Branch pipeline *does not* interact - at all - with the ALU pipeline. can you look again at Mitch's ScoreBoard Mechanics1.pdf damnit this needs to be on the bugreport, i haven't got time, 10m until meeting see page 8, diagram at the top. Those 4 batches of Function Units are *completely separate*. they do *NOT* interact. basically, the first ALU, do it entirely as combinatorial just as in alu_hier.py and assume only one clock cycle. leave out MUL and DIV because those i believe need to be minimum 2 stages for MUL and a *lot* for DIV. actually you know what, screw it: make separate ones for MUL and DIV, pretty much exactly like in that diagram on p8. we're doing a radically different design as a proof-of-concept. * add sub neg compare logical shift all go in their own Reservation-Station-style FunctionUnit * MUL in its own separate RS/FU * DIV likewise l.
Created attachment 52 [details] scoreboard mechanics image michael can you drop this onto the wiki scoreboard page
(In reply to Michael Nolan from comment #3) > > yes, with an "op" - which in our case will basically be "decoder.InternalOp" > > > damnit this needs to be on the bugreport, i haven't got time, 10m until > meeting > > see page 8, diagram at the top. > > Those 4 batches of Function Units are *completely separate*. they do > *NOT* interact. Oh ok. So essentially we will have *separate* ports for the addition pipe, the multiplication pipe, the branch pipe, and instructions can be issued to each in parallel. > actually you know what, screw it: make separate ones for MUL and DIV, > pretty much exactly like in that diagram on p8. we're doing a > radically different design as a proof-of-concept. > > * add sub neg compare logical shift all go in their own > Reservation-Station-style FunctionUnit > * MUL in its own separate RS/FU > * DIV likewise yeah looks like it
src/soc/alu/pipe_data.py will need to pass in everything and pass out everything. carry-in as well as carry-out. the pipelines have to take *everything* that they need to produce a result: they are not permitted to access global resources such as regfiles. this is why we have different FunctionUnits with totally different specs() because for Branch it will take CTR and LR *as input*, where for ALU it will take CR0 and a and b and so on and for the MUL FU it will take about *four* possibly 5 inputs: RA RB RC (for MAC) *and* carry-in *and* CR0 etc. etc.
(In reply to Michael Nolan from comment #5) > > Those 4 batches of Function Units are *completely separate*. they do > > *NOT* interact. > Oh ok. So essentially we will have *separate* ports for the addition pipe, > the multiplication pipe, the branch pipe, and instructions can be issued to > each in parallel. yeeees :) now you're getting it. the DMs preserve the register-usage order and manage the regfile Buses. the only "stalling" that's done is when the Issue stage cannot find a free ReservationStation row to reserve for the instruction. the output from the pipeline goes into the RS row "output" latch and sits there until the FunctionUnit says "DMs say we have a free and clear path to write to the regfile: go... like... go NOW". the more regfile ports you have, the more results get written in parallel. likewise for regfile reads for getting the operands *into* the RS Row(s).
(In reply to Luke Kenneth Casson Leighton from comment #4) > Created attachment 52 [details] > scoreboard mechanics image > > michael can you drop this onto the wiki scoreboard page thank you. meeting still not happened yet, waiting for a call, did a quick section and also added the corresponding write diagram https://libre-soc.org/3d_gpu/architecture/6600scoreboard/?updated so just to emphasise, michael: for ALUInitialData it will need: +class ALUInitialData(IntegerData): + def __init__(self, pspec): + super().__init__(pspec) + self.a = Signal(64, reset_less=True) + self.b = Signal(64, reset_less=True) plus XER carry-in as an input operand and also the opcode itself. although... that's going to be in ctx, as opkls? hmm... see separate comment and a *separate* output class (not the same as ALUInitialData) will be needed with: * output result (RT) * CA CA32 * CR0 * SO OV OV32
there's a bit of a nuisance... no there isn't, you're good. FPPipeContext contains the muxid (which the ReservationStation uses as an array index). you got it right by passing in CompAlUOpSubset as the opkls.
Does it make sense to have the ALU inputs a and b be PartitionedSignals?
(In reply to Michael Nolan from comment #10) > Does it make sense to have the ALU inputs a and b be PartitionedSignals? later. not right now. if we have time we convert them to PartitionedSignals.
(In reply to Luke Kenneth Casson Leighton from comment #11) > (In reply to Michael Nolan from comment #10) > > Does it make sense to have the ALU inputs a and b be PartitionedSignals? > > later. not right now. if we have time we convert them to > PartitionedSignals. i know :) we did all that work and we need to focus first on a scalar version. it means not needing to fuss about parallel-masked decisions and so on, which i am expecting shouuuld make development time faster.
this: # If there's an immediate, set the B operand to that with m.If(self.i.ctx.op.imm_data.imm_ok): comb += self.o.b.eq(self.i.ctx.op.imm_data.imm) with m.Else(): comb += self.o.b.eq(self.i.b) ...is already done by the Computation Unit: # select immediate if opcode says so. however also change the latch # to trigger *from* the opcode latch instead. op_is_imm = oper_r.imm_data.imm_ok src2_or_imm = Signal(self.rwid, reset_less=True) src_sel = Signal(reset_less=True) m.d.comb += src_sel.eq(Mux(op_is_imm, opc_l.q, src_l.q[1])) m.d.comb += src2_or_imm.eq(Mux(op_is_imm, oper_r.imm_data.imm, self.src2_i)) there is *not* supposed to be a relationship between "register numbers" and "Function Unit Operands". Function Units are not permitted to know about registers, at all (however by a coincidence we happen to be preserving the order, RA = operand1, RB = operand2 etc). however... i quite like that it's possible to test against the actual decoded instructions, and that makes me think that maybe yes, the FunctionUnit should *not* be decoding the immediate, just pass it through. otherwise to achieve the same unit test coverage it would be necessary to have the immediate decoding in the unit test. i will go ahead and modify the compalu_multi.py to remove immediate decoding/selection from here: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu_multi.py;h=f9cb83ea13184bb6ff9ceedb7627b7ea67295c55;hb=HEAD#l166 ---- here, it receives data in ALUInputData format (which is correct) class ALUInputStage(PipeModBase): def __init__(self, pspec): super().__init__(pspec, "input") def ispec(self): return ALUInputData(self.pspec) def ospec(self): return ALUInputData(self.pspec) however the output spec is a *modified* copy of the same data structure. with "carry_in" being *generated* by ALUInputStage, ALUInputData should *not* have "carry_in" as a field. the reason is because it will never be set by the *user* of this stage, and consequently a dangling wire will end up being created. i've created a class ALUFirstInputData (bad name i know) which should be used instead. ----- what's "so" intended for?
(In reply to Luke Kenneth Casson Leighton from comment #13) > with "carry_in" being *generated* by ALUInputStage, ALUInputData should > *not* have "carry_in" as a field. We still need a carry_in from XER for things like add with carry. > the reason is because it will never be set by the *user* of this stage, > and consequently a dangling wire will end up being created. > ----- > > what's "so" intended for? Summary Overflow.
(In reply to Michael Nolan from comment #14) > (In reply to Luke Kenneth Casson Leighton from comment #13) > > > with "carry_in" being *generated* by ALUInputStage, ALUInputData should > > *not* have "carry_in" as a field. > > We still need a carry_in from XER for things like add with carry. ah doh, yes you are absolutely right. hmm hmmm sooo.... actually.... the default action from the if elif batch is to copy that carry_in from the input spec over to the output spec. > > > the reason is because it will never be set by the *user* of this stage, > > and consequently a dangling wire will end up being created. > > ----- > > > > what's "so" intended for? > > Summary Overflow. okaay. so if it's a generated output, set only as a result as part of the operation (in the 2nd or 3rd stage), it should *not* be in the ispec(), only in the ospec() for later stage(s) basically: * if something is needed in the stage it must be added to the ispec() * if something is produced *by* the stage ot must be added to the ospec() * if something is needed by later stages, the only way they will get it is by "passing through" the current stage and consequently it has to be added to *both* ispec and ospec. * if a final output is generated by an early stage, then clearly the only way to get it all the way to the ends is, again, by "passing through" to later stages and consequently putting it into both ispec and ospec. summary-out sounds like it is output-only (3rd stage?). carry-out likewise (from 2nd stage) carry-in you are right is needed as input *and* output on *this* (first) stage, where in later stages it will not be. a and b and op are all "passthru" for this 1st stage. however a and b are input only or the 2nd stage, but you *might* need some of the information from op for the 3rd stage. welcome to pipelines :)
(In reply to Luke Kenneth Casson Leighton from comment #15) > (In reply to Michael Nolan from comment #14) > > (In reply to Luke Kenneth Casson Leighton from comment #13) > > > > > with "carry_in" being *generated* by ALUInputStage, ALUInputData should > > > *not* have "carry_in" as a field. > > > > We still need a carry_in from XER for things like add with carry. > > ah doh, yes you are absolutely right. > > hmm hmmm sooo.... > > actually.... the default action from the if elif batch is to copy that > carry_in from the input spec over to the output spec. > > > > > > > the reason is because it will never be set by the *user* of this stage, > > > and consequently a dangling wire will end up being created. > > > ----- > > > > > > what's "so" intended for? > > > > Summary Overflow. > > okaay. so if it's a generated output, set only as a result as part of the > operation (in the 2nd or 3rd stage), it should *not* be in the ispec(), only > in the ospec() for later stage(s) Summary Overflow is "sticky". Once one instruction sets the overflow bit, summary overflow gets set until it's cleared by a mtspr instruction. Effectively for each instruction it's (so = so | ov). That's why it's included in the input stage.
(In reply to Michael Nolan from comment #14) > We still need a carry_in from XER for things like add with carry. just to check, XER is a Condition Register, right? we may almost certainly need to split out the *individual* fields of XER and other Condition Registers into actual separate "registers", even though they are one or two bits. the reason is because if we do not it will create artificial dependency barriers between instructions that read or write some part of XER but not another. therefore all of CR0-CR7 need to be *different* "operands", likewise OV32, OV and so on.
(In reply to Michael Nolan from comment #16) > Summary Overflow is "sticky". Once one instruction sets the overflow bit, > summary overflow gets set until it's cleared by a mtspr instruction. > Effectively for each instruction it's (so = so | ov). That's why it's > included in the input stage. ohhh. yep. got it now. can you drop a comment into the data structure to that effect? the idea being that "why is this here" should be answered (or, a reminder of the answer)
ok as it's "pass-through", i added so (and a comment) to main_stage. also added some similar comments (and pass-through of so) on input_stage. diff --git a/src/soc/alu/main_stage.py b/src/soc/alu/main_stage.py index e801c96..07fb018 100644 --- a/src/soc/alu/main_stage.py +++ b/src/soc/alu/main_stage.py @@ -39,7 +39,9 @@ class ALUMainStage(PipeModBase): with m.Case(InternalOp.OP_XOR): comb += self.o.o.eq(self.i.a ^ self.i.b) + ###### sticky overflow and context, both pass-through ##### + comb += so.eq(self.i.so) comb += self.o.ctx.eq(self.i.ctx) return m about this: class ALUOutputStage(PipeModBase): def __init__(self, pspec): super().__init__(pspec, "output") def ispec(self): return ALUOutputData(self.pspec) def ospec(self): return ALUOutputData(self.pspec) if ALUOutputSpec is designed to generate ov, ov32, cr0 in *this* stage (final one), then an ALUIntermediateData is needed which does not have ov, ov32 or cr0 (etc.) in it.
(In reply to Luke Kenneth Casson Leighton from comment #19) > stage (final one), then an ALUIntermediateData is needed p.s. feel free to pick a better name :)
hiya michael i deployed a trick that was used in the PartitionedAdder. rather than use 2 separate adds, A and B are shifted up by 1 bit, then you put a 1 into A[0] and the carry into B[0]. that way you do a 66 bit add, discard bit 0, result is bits 1-64, and carry is the MSB.
are you studying this btw? https://github.com/antonblanchard/microwatt/blob/master/execute1.vhdl
that's interesting. what's the popcounts all about? https://github.com/antonblanchard/microwatt/blob/master/logical.vhdl
https://github.com/antonblanchard/microwatt/blob/master/ppc_fx_insns.vhdl line 563 is where sraw, srad, srawi and sradi exist. there's a *lot* of code-duplication, which is fairly normal in "traditional" HDLs because parameterisation can be harder to do.
(In reply to Luke Kenneth Casson Leighton from comment #24) > https://github.com/antonblanchard/microwatt/blob/master/ppc_fx_insns.vhdl > > line 563 is where sraw, srad, srawi and sradi exist. there's a *lot* > of code-duplication, which is fairly normal in "traditional" HDLs because > parameterisation can be harder to do. Those are already done in main_stage.py, though they could do with a bit more rigorous testing.
I just realized I'm going to run into some difficulty with rlwinm, as the values controlling the mask amounts (MB and ME) aren't exposed by Decode2ToExecute1. Should they be added, or is there another path you think we should take for things like this (there are almost certainly other instructions with fields like that).
(In reply to Michael Nolan from comment #26) > I just realized I'm going to run into some difficulty with rlwinm, as the > values controlling the mask amounts (MB and ME) aren't exposed by > Decode2ToExecute1. Should they be added, or is there another path you think > we should take for things like this (there are almost certainly other > instructions with fields like that). hmm hmm don't know. microwatt throws the actual instruction opcode around, as a global, which is fine for single issue for a large scale multi issue processor with pipelines, not so much. option 1 * pass the undecoded insn via Execute1Tyoe and add the POWER decoder "field selector" to pspec. this will get the bitfields and forms etc option 2 * add mb and me and slowly more and more copies of fields over time. honestly we will probably have to do (1) sooner rather than later although (2) is the easier option. any other ideas?
(In reply to Michael Nolan from comment #25) > (In reply to Luke Kenneth Casson Leighton from comment #24) > > https://github.com/antonblanchard/microwatt/blob/master/ppc_fx_insns.vhdl > > > > line 563 is where sraw, srad, srawi and sradi exist. there's a *lot* > > of code-duplication, which is fairly normal in "traditional" HDLs because > > parameterisation can be harder to do. > > Those are already done in main_stage.py, nice. > though they could do with a bit > more rigorous testing. i noticed an assert on bits in SelectableInt not being happy. cmp by the way apologies for moving the add into the switch statement. don't do trap in this Function Unit, it changes the PC so i think really should be in a special FunctionUnit which outputs a new PC, alongside Branch etc.
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/alu/main_stage.py;hb=HEAD#l137 i'm fairly certain that this: 136 with m.Case(InternalOp.OP_RLC): 137 with m.If(self.i.ctx.op.imm_data.imm_ok): 138 comb += rotate_amt.eq(self.i.ctx.op.imm_data.imm[0:5]) 139 with m.Else(): 140 comb += rotate_amt.eq(self.i.b[0:5]) can just be this: 136 with m.Case(InternalOp.OP_RLC): 137 comb += rotate_amt.eq(self.i.b[0:5]) because in input_stage.py there is this: 40 # If there's an immediate, set the B operand to that 41 with m.If(self.i.ctx.op.imm_data.imm_ok): 42 comb += self.o.b.eq(self.i.ctx.op.imm_data.imm) 43 with m.Else(): 44 comb += self.o.b.eq(self.i.b)
(In reply to Luke Kenneth Casson Leighton from comment #29) > https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/alu/main_stage.py; > hb=HEAD#l137 > > i'm fairly certain that this: > > 136 with m.Case(InternalOp.OP_RLC): > 137 with m.If(self.i.ctx.op.imm_data.imm_ok): > 138 comb += > rotate_amt.eq(self.i.ctx.op.imm_data.imm[0:5]) > 139 with m.Else(): > 140 comb += rotate_amt.eq(self.i.b[0:5]) > > can just be this: > > 136 with m.Case(InternalOp.OP_RLC): > 137 comb += rotate_amt.eq(self.i.b[0:5]) > > because in input_stage.py there is this: > > 40 # If there's an immediate, set the B operand to that > 41 with m.If(self.i.ctx.op.imm_data.imm_ok): > 42 comb += self.o.b.eq(self.i.ctx.op.imm_data.imm) > 43 with m.Else(): > 44 comb += self.o.b.eq(self.i.b) The issue is the `rlwimi` instruction, which in addition to RS and an immediate also reads in RA. Because of this, I disabled putting the immediate into operand b in the input stage of the ALU for the RLC opcode
these would be better (again) as (sigh) add(x,y) etc. sub(63, n) rather than 63-n also if the use of EXTZ64 *really* is necessary, that's actually a bug in the spec rather than something that we should arbitrarily fix, and it should be reported. @@ -168,9 +168,9 @@ @inject() def op_slw(self, RB, RS): n = RB[59:64] - r = ROTL32(RS[32:64], n) + r = ROTL32(EXTZ64(RS[32:64]), n.value) if eq(RB[58], 0): - m = MASK(32, 63 - n) + m = MASK(32, 63 - n.value) else: m = concat(0, repeat=64) RA = r & m
(In reply to Michael Nolan from comment #30) > The issue is the `rlwimi` instruction, which in addition to RS and an > immediate also reads in RA. Because of this, I disabled putting the > immediate into operand b in the input stage of the ALU for the RLC opcode interesting. am taking a closer look. it doesn't look that different.
i haven't committed it (i leave it to you to evaluate) this passes the tests nosetests3 decoder/isa/test_caller.py register RA is read in - however it's not involved in the datapath for RB. here's the pseudocode * rlwinm RA,RS,SH,MB,ME (Rc=0) * rlwinm. RA,RS,SH,MB,ME (Rc=1) n <- SH r <- ROTL32((RS)[32:63], n) m <- MASK(MB+32, ME+32) RA <- r & m * rlwnm RA,RS,RB,MB,ME (Rc=0) * rlwnm. RA,RS,RB,MB,ME (Rc=1) n <- (RB)[59:63] r <- ROTL32((RS)[32:63], n) m <- MASK(MB+32, ME+32) RA <- r & m the only difference is in n both are 5 bit, SH should be coming from the immediate-decoding, (RB)[59:63] means get the 5 LSBs of RB. i believe we're ok making the change below. it doesn't make sense that IBM would design something that needs a special exception path for the immediate, just for one instruction. lkcl@fizzy:~/src/libreriscv/soc/src/soc$ git diff diff --git a/src/soc/alu/input_stage.py b/src/soc/alu/input_stage.py index 389954b..2dda52a 100644 --- a/src/soc/alu/input_stage.py +++ b/src/soc/alu/input_stage.py @@ -39,8 +39,7 @@ class ALUInputStage(PipeModBase): ##### operand B ##### # If there's an immediate, set the B operand to that - with m.If(self.i.ctx.op.imm_data.imm_ok & - ~(self.i.ctx.op.insn_type == InternalOp.OP_RLC)): + with m.If(self.i.ctx.op.imm_data.imm_ok): comb += self.o.b.eq(self.i.ctx.op.imm_data.imm) with m.Else(): comb += self.o.b.eq(self.i.b) diff --git a/src/soc/alu/main_stage.py b/src/soc/alu/main_stage.py index 59c896a..173ca78 100644 --- a/src/soc/alu/main_stage.py +++ b/src/soc/alu/main_stage.py @@ -134,10 +134,7 @@ class ALUMainStage(PipeModBase): comb += self.o.o.eq(rotl_out & mask) with m.Case(InternalOp.OP_RLC): - with m.If(self.i.ctx.op.imm_data.imm_ok): - comb += rotate_amt.eq(self.i.ctx.op.imm_data.imm[0:5]) - with m.Else(): - comb += rotate_amt.eq(self.i.b[0:5]) + comb += rotate_amt.eq(self.i.b[0:5]) comb += maskgen.mb.eq(mb+32) comb += maskgen.me.eq(me+32) comb += mask.eq(maskgen.o)
https://libre-soc.org/openpower/isatables/ hmmm, anton's team has synthesised the commonality between the different rotates and shifts https://github.com/antonblanchard/microwatt/blob/master/rotator.vhdl back in execute1.vhdl they do this: right_shift <= '1' when e_in.insn_type = OP_SHR else '0'; rot_clear_left <= '1' when e_in.insn_type = OP_RLC or e_in.insn_type = OP_RLCL else '0'; rot_clear_right <= '1' when e_in.insn_type = OP_RLC or e_in.insn_type = OP_RLCR else '0';
i can do a brain-dead translation of this code to nmigen, tomorrow, if you're good with that? (i quite like doing no-brain work...) https://github.com/antonblanchard/microwatt/blob/master/rotator.vhdl
(In reply to Luke Kenneth Casson Leighton from comment #35) > i can do a brain-dead translation of this code to nmigen, tomorrow, if you're > good with that? (i quite like doing no-brain work...) > https://github.com/antonblanchard/microwatt/blob/master/rotator.vhdl okok i just did it anyway :) https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=0604fe1e0b040def2e3f599fc523340eeeefe519 i cut out the hand-coded ROTL and replaced it with... ROTL.
(In reply to Luke Kenneth Casson Leighton from comment #33) > i haven't committed it (i leave it to you to evaluate) > this passes the tests nosetests3 decoder/isa/test_caller.py > > register RA is read in - however it's not involved in the datapath for RB. > > here's the pseudocode > > * rlwinm RA,RS,SH,MB,ME (Rc=0) > * rlwinm. RA,RS,SH,MB,ME (Rc=1) > > n <- SH > r <- ROTL32((RS)[32:63], n) > m <- MASK(MB+32, ME+32) > RA <- r & m > > * rlwnm RA,RS,RB,MB,ME (Rc=0) > * rlwnm. RA,RS,RB,MB,ME (Rc=1) > > n <- (RB)[59:63] > r <- ROTL32((RS)[32:63], n) > m <- MASK(MB+32, ME+32) > RA <- r & m > > the only difference is in n > > both are 5 bit, SH should be coming from the immediate-decoding, (RB)[59:63] > means get the 5 LSBs of RB. > > i believe we're ok making the change below. it doesn't make sense that IBM > would design something that needs a special exception path for the immediate, > just for one instruction. Look at rlwimi: n <- SH r <- ROTL32(RS[32:63], n) m <- MASK(MB+32, ME+32) RA <- r & m | (RA) & ~m It definitely does require reading in RS, RA, and SH
Ugh. I added support to the alu testbench for reading the registers specified in the instruction, instead of hardcoding them to 1, 2, and 3. I have it read the register select ports in the decoder in the order [3, 1, 2], and the first 2 with valid data end up in the a and b inputs of the ALU. I'm a bit annoyed about the weird order, but everything seems to work
(In reply to Michael Nolan from comment #37) > Look at rlwimi: > > n <- SH > r <- ROTL32(RS[32:63], n) > m <- MASK(MB+32, ME+32) > RA <- r & m | (RA) & ~m > > It definitely does require reading in RS, RA, and SH that's ok - yes it reads RS, RA and SH. what it doesn't do is need to switch off RB imm decoding. at least, let me put it this way: the unit test passes fine with that patch.
(In reply to Michael Nolan from comment #38) > Ugh. I added support to the alu testbench for reading the registers > specified in the instruction, instead of hardcoding them to 1, 2, and 3. I > have it read the register select ports in the decoder in the order [3, 1, > 2], and the first 2 with valid data end up in the a and b inputs of the ALU. > I'm a bit annoyed about the weird order, but everything seems to work hmmm hmmm hang on hang on - this is removing the order. if there's 3 regs, then 3 regs are needed. what's needed is like this: inputs = [] reg3_ok = yield dec2.e.read_reg3.ok reg3_sel = 0 if reg3_ok: reg3_sel = yield dec2.e.read_reg3.data reg3_sel = (sim.gpr(reg3_sel).value inputs.append(reg3_sel, reg3_ok) ... ... followed by: yield alu.p.data_i.a.eq(input[0][0]) # RA yield alu.p.data_i.b.eq(input[1][0]) # RB yield alu.p.data_i.c.eq(input[2][0]) # RC def set_alu_inputs(alu, dec2, sim): inputs = [] reg3_ok = yield dec2.e.read_reg3.ok if reg3_ok: reg3_sel = yield dec2.e.read_reg3.data inputs.append(sim.gpr(reg3_sel).value) reg1_ok = yield dec2.e.read_reg1.ok if reg1_ok: reg1_sel = yield dec2.e.read_reg1.data inputs.append(sim.gpr(reg1_sel).value) reg2_ok = yield dec2.e.read_reg2.ok if reg2_ok: reg2_sel = yield dec2.e.read_reg2.data inputs.append(sim.gpr(reg2_sel).value) print(inputs) if len(inputs) == 0: yield alu.p.data_i.a.eq(0) yield alu.p.data_i.b.eq(0) if len(inputs) == 1: yield alu.p.data_i.a.eq(inputs[0]) yield alu.p.data_i.b.eq(0) if len(inputs) == 2: yield alu.p.data_i.a.eq(inputs[0]) yield alu.p.data_i.b.eq(inputs[1])
ok - *sigh* - we _could_ define it this way however if you look in the decode1/2.vhdl you'll see that anton's team defined RS == RC. (reg3_data). https://github.com/antonblanchard/microwatt/blob/master/decode2.vhdl#L124 output reg is defined as RA and RT (yes we need a 2nd one) https://github.com/antonblanchard/microwatt/blob/master/decode2.vhdl#L135 in this way i believe RB *always* ends up as "the register that can be replaced by an immediate at the input side" https://github.com/antonblanchard/microwatt/blob/master/decode2.vhdl#L79 you see what i mean? so it's because reg positions are lost that RA suddenly receives reg data that was supposed to be in RB
(In reply to Luke Kenneth Casson Leighton from comment #40) > hmmm hmmm hang on hang on - this is removing the order. if there's 3 regs, > then 3 regs are needed. > > > what's needed is like this: > <snip>...</snip> > yield alu.p.data_i.a.eq(input[0][0]) # RA > yield alu.p.data_i.b.eq(input[1][0]) # RB > yield alu.p.data_i.c.eq(input[2][0]) # RC > I went through the tables and double checked, the *only* ALU instructions that have 3 inputs are 2 registers and an immediate. It does not *need* 3 register inputs. However, some piece of hardware needs to order the inputs, I guess the trade off is whether to do it in the alu or somewhere else.
(In reply to Michael Nolan from comment #42) > I went through the tables and double checked, the *only* ALU instructions > that have 3 inputs are 2 registers and an immediate. It does not *need* 3 > register inputs. yehyeh. ok. really... normally, what would happen is that anything that's significantly different like this would get put into its own Function Unit (into its own ALU Pipeline). then, there's no "switching about at the ALU side" > However, some piece of hardware needs to order the inputs, I guess the trade > off is whether to do it in the alu or somewhere else. i think one of the reasons why processors go with a convention "B is reg or immediate" is to save wires. you've got 64 bits worth of wires for B, B is not used all the time, so when it's not, use it to farm in the "immediate" data. so i think we'll find... 1 sec let's look at the original decode2.vhdl... ah yes: https://github.com/antonblanchard/microwatt/blob/master/decode2.vhdl#L89 ok so that's definitely putting the *actual* immediate data into reg b. decode_input_reg_b() is where the decision is made to either put the regfile data into the return result or put the immediate in it. so given that we're using microwatt's "decoder" we really need to follow the same conventions. and here's where the execution has *three* inputs: https://github.com/antonblanchard/microwatt/blob/master/execute1.vhdl#L61 a_in, b_in, c_in. so my feeling is we: * have the ALU with the same 3 inputs (even though we know only 2 of them are used), and keep the same order * where the immediate is availabe assume it is always dropped into the b_in 64-bit wires thus: * remove the MUX from ALUInputStage (that means in the unit tests it has to be done there) with m.If(self.i.ctx.op.imm_data.imm_ok) comb += self.o.b.eq(self.i.ctx.op.imm_data.imm) with m.Else(): comb += self.o.b.eq(self.i.b) so, in set_alu_inputs it would be: inputs = [] # C (or A?) reg3_ok = yield dec2.e.read_reg3.ok reg3_sel = 0 if reg3_ok: reg3_sel = yield dec2.e.read_reg3.data reg3_sel = (sim.gpr(reg3_sel).value inputs.append(reg3_sel, reg3_ok) # B (or imm) with m.If(self.i.ctx.op.imm_data.imm_ok) reg2_sel = yield self.i.ctx.op.imm_data.imm) with m.Else(): reg2_ok = yield dec2.e.read_reg2.ok reg2_sel = 0 if reg2_ok: reg2_sel = yield dec2.e.read_reg2.data reg2_sel = (sim.gpr(reg2_sel).value inputs.append(reg2_sel, reg2_ok) # A (or C?) and then still have 3 inputs, and in this way we save 64 bits worth of wires because B is used to input the immediate. one other key reason for doing it this way is because the FunctionUnit front-end is where the decisions are made about reading from the regfile (and acknowleding them). so it has to decode the Operand, there, anyway, because it must not even send a *request* for Register B if it does not actually need it.
see this (yes it's LD/ST not an ALU) https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg you see how Op2 (RB) is selected through a mux, based on the Imm/Opcode? the ALU (the weird "boat" symbol) doesn't get to make the decision, it gets "told". however if you follow the "imm_ok" wire from the Imm/Opcode latch, you'll see it PREVENTS REQ_RD2 from even being activated, when "imm_ok" is true. this is the primary purpose (role) of the Computation Units: make requests for Register File Ports *that are needed*.
Ran 1 test in 1.662s that's a little bit better :)
(In reply to Luke Kenneth Casson Leighton from comment #43) > (In reply to Michael Nolan from comment #42) > > > I went through the tables and double checked, the *only* ALU instructions > > that have 3 inputs are 2 registers and an immediate. It does not *need* 3 > > register inputs. > > yehyeh. ok. > > really... normally, what would happen is that anything that's significantly > different like this would get put into its own Function Unit (into its own > ALU Pipeline). https://libre-soc.org/openpower/isa/fixedshift/ ... actually... you know what? all of those are M-Form and MD-Form whereas, quickly analysing these: https://libre-soc.org/openpower/isa/fixedarith/ they're D-Form, DX-Form, XO-Form, and XO-Form. some of the weird ones (addex) are Z23-Form, and the MAC ones are VA-Form on balance therefore it kinda makes sense to split fixedshift out into its own separate ALU / pipeline. aside from anything, it may turn out to be the case that whilst add, sub etc. can be done in a single clock, shift has a strong possibility of needing 2 (in really high speed designs). that would be easier to do if they were separate pipelines.
ok let's put the forms side-by-side, from fields.txt #1.6.4 D-FORM | OPCD | RT | RA| D | # V3.0B 1.6.6 DX-FORM | OPCD| RT| d1| d0| XO|d2 # 1.6.16 XO-FORM | OPCD | RT| RA| RB |OE | XO |Rc | and the shift ones are: # 1.6.18 M-FORM | OPCD | RS | RA | RB | MB | ME |Rc| # 1.6.19 MD-FORM | OPCD | RS | RA | sh | mb |XO|sh|Rc| # 1.6.7 X-FORM | OPCD | RT | RA | /// | XO | / | so... well... i'm not actually seeing any 3-operand-input forms, there. they're all 2-operand or 1-operand-plus-immediate. note: 3-operand *input* not 3-operand *total including output operand*. so i'm now wondering what on earth is going on, because we shouldn't need to be reading reg3 at all. any clues?
(In reply to Luke Kenneth Casson Leighton from comment #47) > ok let's put the forms side-by-side, from fields.txt > > #1.6.4 D-FORM > | OPCD | RT | RA| D | > > # V3.0B 1.6.6 DX-FORM > | OPCD| RT| d1| d0| XO|d2 > > # 1.6.16 XO-FORM > | OPCD | RT| RA| RB |OE | XO |Rc | > > > and the shift ones are: > > # 1.6.18 M-FORM > | OPCD | RS | RA | RB | MB | ME |Rc| > > # 1.6.19 MD-FORM > | OPCD | RS | RA | sh | mb |XO|sh|Rc| > > # 1.6.7 X-FORM > | OPCD | RT | RA | /// | XO | / | > > > so... well... i'm not actually seeing any 3-operand-input forms, there. > they're all 2-operand or 1-operand-plus-immediate. > > note: 3-operand *input* not 3-operand *total including output operand*. > > so i'm now wondering what on earth is going on, because we shouldn't need > to be reading reg3 at all. > > any clues? We only need two register operands and an immediate for all ALU instructions. The only ones with 3 register operands are store indexed instructions
ok. i think we need to look more closely at decoder2.py compared to decode2.vhdl https://github.com/antonblanchard/microwatt/blob/master/decode2.vhdl i'm not seeing any reason though why the operands should need "swapping over": why RC is needing to go into a. the alignment should definitively be: * reg3_data into c * reg2_data (or imm) into b * reg1_data into a ah hang on... https://github.com/antonblanchard/microwatt/blob/master/execute1.vhdl see lines 183-184: logical_0: entity work.logical port map ( rs => c_in, rb => b_in, and here, lines 166-169: rotator_0: entity work.rotator port map ( rs => c_in, ra => a_in, that's it. that's what's going on. the different pipelines have *different* register map/alignment, *depending on function-group*. the *adder* takes A as A, and B as B. rotator crosses over, logical crosses over. what a mess. ok what i suggest, then, here is: split out Logical ALU operations as well. the ADD ALU will deal with OP_ADD and OP_CMP. i don't yet know if we should allow the ADD ALU to change the PC (handle OP_TRAP), i'm inclined to suggest that that one be handled in the Branch Computation Unit. by splitting out: * ADD/CMP into one pipeline * Logical into another * Rotator into another basically on the face of it we need to use the FunctionOp to determine the re-routing of the operands. but... it looks like you've discovered a correlation, where if reg_data3 is available it can go *first* (into out.a), otherwise if reg_data1 is available, *that* goes first (into out.a) however B i think definitely goes into out.b, or the immediate definitely goes into out.b, and i think we'll find that there is no "immediate" form with a register B. fields.txt RB (16:20) Field used to specify a GPR to be used as a source. Formats: A, M, MDS, VA, X, XO ok so strictly speaking MDS-Form has immediates in it, however if you look at what anton's done, in minor-opcode table 30: https://libre-soc.org/openpower/isatables/ he's marked rdlcr as *not* reading the immediate (bit 21, mb/me) as an immediate, but instead operand2 is marked as "RB". which explains why we have to decode insn to get at dec2.mb and dec2.me. so i think we're good to go with reg2_data (or imm_data) always going into out.b and perhaps to have an assert that checks, for these 2-operand pipelines, that reg3_ok and reg1_ok can never be set simultaneously, and to always put reg3_data or reg1_data into out.a. perhaps even as an OR rather than a Mux.
I tried using the rotator.py taken from microwatt, and it seems it doesn't work. After fixing some syntax errors, I set the inputs up for a left shift, and it did not output what I expected. I spent around 45 minutes setting up a testbench for microwatt's rotator.vhdl, and it behaves correctly for the same inputs. Considering that I have already written a working shifter, is it worthwhile trying to get this one to work?
(In reply to Michael Nolan from comment #50) > I tried using the rotator.py taken from microwatt, and it seems it doesn't > work. After fixing some syntax errors, I set the inputs up for a left shift, > and it did not output what I expected. I spent around 45 minutes setting up > a testbench for microwatt's rotator.vhdl, and it behaves correctly for the > same inputs. i may have messed things up in the translation from VHDL to nmigen, i didn't write a unit test for it. > Considering that I have already written a working shifter, is it worthwhile > trying to get this one to work? there are some significant elegant aspects of the microwatt-based one, namely that it covers *all* shift operations in that short and compact code. it uses the trick of having only 1 64-bit shifter (ROTL) instance, putting the 32-bit data in *twice* - once into the top 32-bit and again into the bottom 32-bit - reducing the amount of hardware needed to perform both 32 and 64 bit shifts by at least 30%. however if you feel that you can get something operational and compliant quicker than by debugging rotator.py then go for it. we can look at clean-up and merge, later.
(In reply to Michael Nolan from comment #50) > I tried using the rotator.py taken from microwatt, and it seems it doesn't > work. After fixing some syntax errors, I set the inputs up for a left shift, > and it did not output what I expected. can you commit what you have, i'm background-curious to see the output.
you got it working, cool!
oops soc.shift_rotate.input_stage missing. remember use "git status" to check :)
(In reply to Luke Kenneth Casson Leighton from comment #54) > oops soc.shift_rotate.input_stage missing. remember use "git status" to > check :) got it.... :)
you probably noticed, i split out Logical operations into a separate pipeline, mirroring logical.vhdl https://github.com/antonblanchard/microwatt/blob/master/logical.vhdl popcount, parity, i've added TODO.
https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=de6e2485dff057322f6686ddeba6458b63a47159 ah well spotted, the correct field
left some TODOs in the code this morning, UK time: @@ -41,6 +41,12 @@ class LogicalMainStage(PipeModBase): # TODO with m.Case(InternalOp.OP_POPCNT): ###### parity ####### # TODO with m.Case(InternalOp.OP_PRTY): + ###### cmpb ####### + # TODO with m.Case(InternalOp.OP_CMPB): + ###### cntlz ####### + # TODO with m.Case(InternalOp.OP_CNTZ): + ###### bpermd ####### + # TODO with m.Case(InternalOp.OP_BPERM): - not in microwatt and cmp because it is so similar to ADD should go in ALU 37 with m.Switch(self.i.ctx.op.insn_type): 38 #### CMP, CMPL #### 39 # TODO with m.Case(InternalOp.OP_CMP): i've made the corresponding changes to power_enum Function (and to the CSV files) already.
*headdesk* cmp is BACKWARDS! The following will run both a subtract from and cmp and save their respective updates to cr0 to r2 and r1 respectively: li 6, 0x10 li 7, 0x05 # set up inputs li 1, 0 mtcr 1 # clear cr subf. 5, 6, 7 mfcr 2 # save cr to r2 mtcr 1 # clear cr again cmpl cr0, 1, 6, 7 # This *should* give the same result as the subf above mfcr 1 # save cr to r1 (gdb) p/x $r1 $1 = 0x40000000 (gdb) p/x $r2 $2 = 0x80000000 Why IBM, why did you do this?
sub pseudocode: * subf RT,RA,RB (OE=0 Rc=0) RT <- ¬(RA) + (RB) + 1 which is... b-a and in alu/output_stage: comb += self.o.cr0.eq(Cat(so, is_zero, is_positive, is_negative)) so bit 0 of CR0 is "is zero" bit 1 is "is +ve" bit 2 is "is -ve" cmp pseudocode: if L = 0 then a <- EXTS((RA)[32:63] ) b <- EXTS((RB)[32:63]) else a <- (RA) b <- (RB) if a < b then c <- 0b100 else if a > b then c <- 0b010 else c <- 0b001 CR[4*BF+32:4*BF+35] <- c || XER[SO] and cmpi pseudocode: if L = 0 then a <- EXTS((RA)[32:63]) else a <- (RA) if a < EXTS(SI) then c <- 0b100 else if a > EXTS(SI) then c <- 0b010 else c <- 0b001 CR[4*BF+32:4*BF+35] <- c || XER[SO] so that is *a* less than B/immediate, bit 2 (-ve) is set) *a* greater than B/immediate, bit 1 (+ve) is set) which is the opposite of add/subtract. i presume this is best implemented by calculating "a-b" (rather than b-a). note in microwatt, there's comments at line 540 about this: https://github.com/antonblanchard/microwatt/blob/master/execute1.vhdl we can't invert the operands outside of the pipeline: that means that the register port allocation system needs to have a "decoder" in it. the inversion needs to be done *inside* the pipeline, not outside... ok just checking alu/input_stage.py, ok yes that looks good. i say "good", i mean "functional but messy as hell" :) i wonder... i wonder if, rather than mess up the input stage *and* the main stage, which involves two 64-bit MUXes, if it would not be better to swap is_positive and is_negative in the *output* stage? this would only be QTY 2 single-bit MUXes
(In reply to Luke Kenneth Casson Leighton from comment #60) > i wonder... i wonder if, rather than mess up the input stage *and* the > main stage, which involves two 64-bit MUXes, if it would not be better > to swap is_positive and is_negative in the *output* stage? > > this would only be QTY 2 single-bit MUXes i'll give this a shot, see what happens.
(In reply to Luke Kenneth Casson Leighton from comment #61) > i'll give this a shot, see what happens. it worked! i'll do the formal proof output stage modifications
(In reply to Michael Nolan from comment #59) > *headdesk* move the keyboard to the left, first... > cmp is BACKWARDS! :) > (gdb) p/x $r1 > $1 = 0x40000000 > (gdb) p/x $r2 > $2 = 0x80000000 are ya having fun, yet? > Why IBM, why did you do this? hm i dunno... i mean... it could be justified if it's less code to check +ve condition quicker than it is to check -ve condition or vice-versa. then you would use sub for one and cmp for the other. honestly: am clueless :)
(In reply to Luke Kenneth Casson Leighton from comment #63) > honestly: am clueless :) ah i think i might know why. as we found out in the PartitionedSignal GT / LT (etc) code, a comparator is much simpler than a full adder. it's only *because* we are using the output from the adder (actually, subtracter) - purely because it's there - that we're noticing the inversion. if however we had a totally separate Function Unit for CMP, we'd use the less-gates solution, and consequently could swap A and B over. then you'd never notice the difference :)
(In reply to Luke Kenneth Casson Leighton from comment #64) > (In reply to Luke Kenneth Casson Leighton from comment #63) > > > honestly: am clueless :) > > ah i think i might know why. as we found out in the PartitionedSignal > GT / LT (etc) code, a comparator is much simpler than a full adder. > > it's only *because* we are using the output from the adder (actually, > subtracter) - purely because it's there - that we're noticing the > inversion. > > if however we had a totally separate Function Unit for CMP, we'd use > the less-gates solution, and consequently could swap A and B over. > then you'd never notice the difference :) That makes a little more sense. Though if they're going to the trouble of making subf the way it is, you'd think they'd make this the same way. Oh well
(In reply to Michael Nolan from comment #65) > That makes a little more sense. Though if they're going to the trouble of > making subf the way it is, you'd think they'd make this the same way. > > Oh well sigh :) i added popcount (because). it looks like it works except test_cmpb is not working (which stops the simulation early), and if i comment it out, the simulation popcount returns an integer, not a SelectableInt, so goes skew-y. could you take a look?
(In reply to Luke Kenneth Casson Leighton from comment #66) > i added popcount (because). it looks like it works except test_cmpb is > not working after updating submodules (will commit in a sec) File "/home/lkcl/src/libresoc/soc/src/soc/decoder/isa/fixedlogical.py", line 193, in op_cmpb if eq(RS[8 * n:8 * n + 7 + 1], RB[8 * n:8 * n + 7 + 1]): NameError: name 'RS' is not defined
File "/home/lkcl/src/libresoc/soc/src/soc/decoder/isa/fixedlogical.py", line 223, in op_prtyd s = s / RS[i % 8 + 7] erm... ermermerm... that doesn't look right. divide? instead of ^ (XOR)? TypeError: unsupported operand type(s) for /: 'int' and 'SelectableInt' i think, really, we need to do the same trick done with eq, ge etc, turn all operations from using python "op1 + op2" into "add(op1, op2)" and sort out the operands - convert to SelectableInt - in there.
(In reply to Luke Kenneth Casson Leighton from comment #67) > (In reply to Luke Kenneth Casson Leighton from comment #66) > > > i added popcount (because). it looks like it works except test_cmpb is > > not working > > after updating submodules (will commit in a sec) > > File "/home/lkcl/src/libresoc/soc/src/soc/decoder/isa/fixedlogical.py", > line 193, in op_cmpb > if eq(RS[8 * n:8 * n + 7 + 1], RB[8 * n:8 * n + 7 + 1]): > NameError: name 'RS' is not defined Oops, needed a patch (or figuring out why the compiler didn't include RS)
(In reply to Luke Kenneth Casson Leighton from comment #68) > File "/home/lkcl/src/libresoc/soc/src/soc/decoder/isa/fixedlogical.py", > line 223, in op_prtyd > s = s / RS[i % 8 + 7] > > erm... ermermerm... that doesn't look right. divide? instead of ^ (XOR)? > > TypeError: unsupported operand type(s) for /: 'int' and 'SelectableInt' > > i think, really, we need to do the same trick done with eq, ge etc, turn all > operations from using python "op1 + op2" into "add(op1, op2)" and sort out > the operands - convert to SelectableInt - in there. There's no way they actually mean division there... If so, then every time that bit is 0 you get a divide by 0...
(In reply to Michael Nolan from comment #70) > (In reply to Luke Kenneth Casson Leighton from comment #68) > > File "/home/lkcl/src/libresoc/soc/src/soc/decoder/isa/fixedlogical.py", > > line 223, in op_prtyd > > s = s / RS[i % 8 + 7] > > > > erm... ermermerm... that doesn't look right. divide? instead of ^ (XOR)? > There's no way they actually mean division there... If so, then every time > that bit is 0 you get a divide by 0... yep it's a manual transcription error from when i extracted the text from the PDF. i've git pushed ^ instead of /, done the submodule update too.
(In reply to Luke Kenneth Casson Leighton from comment #68) > File "/home/lkcl/src/libresoc/soc/src/soc/decoder/isa/fixedlogical.py", > line 223, in op_prtyd > s = s / RS[i % 8 + 7] > > erm... ermermerm... that doesn't look right. divide? instead of ^ (XOR)? > > TypeError: unsupported operand type(s) for /: 'int' and 'SelectableInt' > The modulus doesn't fit with the description of what the instruction does or with the behavior on qemu. I think they meant for that to be a multiplication. Should we report this to IBM? It looks like they corrected the xor part in the 3.1 spec, but they're still using modulus for the bit indices.
this is from V3.0B - page 98: s <- 0 t <- 0 do i = 0 to 3 s <- s / (RS)[i%8+7] do i = 4 to 7 t <- t / (RS)[i%8+7] so yes that's wrong (divide). microwatt ppc_fx_insns.vhdl has this: function ppc_prtyd (rs: std_ulogic_vector(63 downto 0)) return std_ulogic_vector is variable tmp : std_ulogic; variable ret : std_ulogic_vector(63 downto 0); begin ret := (others => '0'); tmp := '0'; for i in 0 to 7 loop tmp := tmp xor rs(i*8); end loop; ret(0) := tmp; return ret; end; and that's definitely "times 8" there. note that they didn't end up using that function, the loop was unrolled as follows: par0 <= rs(0) xor rs(8) xor rs(16) xor rs(24); par1 <= rs(32) xor rs(40) xor rs(48) xor rs(56); so yes, please do correct the pseudo-code and report it as a bug in the spec to IBM (good idea to cross-reference this discussion)
(In reply to Luke Kenneth Casson Leighton from comment #73) > > so yes, please do correct the pseudo-code and report it as a bug in > the spec to IBM (good idea to cross-reference this discussion) Cool. I've not interacted with IBM before, how do I go about reporting a bug?
(In reply to Michael Nolan from comment #74) > (In reply to Luke Kenneth Casson Leighton from comment #73) > > > > so yes, please do correct the pseudo-code and report it as a bug in > > the spec to IBM (good idea to cross-reference this discussion) > > Cool. I've not interacted with IBM before, how do I go about reporting a bug? oh, sorry: just send a message to openhdl-cores... you may have to subscribe http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-cores/ http://lists.mailinglist.openpowerfoundation.org/mailman/listinfo/openpower-hdl-cores i can do it if you like, so you can focus on coding, etc. up to you, i don't mind.
(In reply to Luke Kenneth Casson Leighton from comment #75) > oh, sorry: just send a message to openhdl-cores... you may have to subscribe > http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl- > cores/ > http://lists.mailinglist.openpowerfoundation.org/mailman/listinfo/openpower- > hdl-cores > > i can do it if you like, so you can focus on coding, etc. up to you, i > don't mind. Reported and the pseudocode has been fixed
(In reply to Michael Nolan from comment #76) > Reported and the pseudocode has been fixed great. ran test_prty and it worked great. hmm at some point we could do with an option to run tests against qemu ppc[32/64][le/be] next on the planning / TODO list, michael: * OP_B* plus OP_TRAP, OP_RFID (anything that sets the PC by setting NIA) * condition register stuff: OP_MFCR, OP_CROP, OP_MTCRF * SPR stuff: OP_MTSPR and OP_MFSPR probably good idea to raise separate bugreports for these. i have no idea where OP_ISEL or OP_MTMSRD would fit :)
michael i am going to do a codeshuffle moving all pipelines to a subdir called pipe. common code can sit at that level including a README, as well as the multi inout construction (based on the ReservationStation class). also at that level would be a specification object that constructs multiple Function Units just like in the diagram from Mitch's book. some of these will be multiple pipes (the ones that only do one clock) and some will be N > 1 ReservationStations but only the one pipe (DIV in particular) also then in front of those will be the relevant CompUnits, some of which we need to construct. ALUCompUnit will be the basis for most of them. it starts with moving the pipelines to a pipe subdir, the rest needs its iwn bugreport.
(In reply to Luke Kenneth Casson Leighton from comment #78) > michael i am going to do a codeshuffle moving all pipelines to a subdir > called pipe. called it "fu" because strictly speaking these pipelines will be managed by Function Unit front-ends.
https://libre-soc.org/openpower/isa/comparefixed/ hmm hmmm the cmpeqb code is stopping ALUIntermediateData from not having cr0 in it. i suggest instead that we put the 8 bits into self.o.o, then special case cmpeqb to test o.any() and only set cr0[2] also hmm, cmp* seem to want to store the comparison in a BR indexed register, not cr0. which sigh tends to suggest we need to move cmp operations to their own pipeline, one that takes the full 32 bits of CR. alternatives are that we expand ALU cr to the full bit width. however if going that route i would like to suggest we pass an index (and mask) saying which CR(s) have been modified. the mask would be sent as write enable lines to the CR regfile. this would remove the need to pass *in* the full CR and stop an unnecessary read hazaelrd being created.
Created attachment 61 [details] ALU main stage hilarious. the actual OP_ADD, OP_CMPEQB, OP_CMP and OP_EXTS operations are ridiculously small compared to the support infrastructure: regfile, DMs, everything basically. i'd heard that routing of data is by far and above the majority of a CPU: it's kind-of illuminating to actually see how small an ALU can be.
apologies, michael, i've introduced a bug somewhere in ALU: adde. 5, 6, 7 ['RA', 'RB'] reading reg 6 reading reg 7 [SelectableInt(value=0xc97cd5181acb17fc, bits=64), SelectableInt(value=0x9de943b0d85daaf9, bits=64)] (SelectableInt(value=0x676618c8f328c2f6, bits=64),) concat 1 SelectableInt(value=0x4, bits=4) [SelectableInt(value=0x1, bits=1), SelectableInt(value=0x1, bits=1)] __eq__ SelectableInt(value=0x0, bits=1) SelectableInt(value=0x1, bits=1) __eq__ SelectableInt(value=0x0, bits=1) SelectableInt(value=0x1, bits=1) writing reg 5 SelectableInt(value=0x676618c8f328c2f6, bits=64) expected 676618c8f328c2f6, actual: 676618c8f328c2f6 File "fu/alu/test/test_pipe_caller.py", line 255, in check_extra_alu_outputs self.assertEqual(cr_expected, cr_actual, code) AssertionError: 4 != 5 : adde. 5, 6, 7
um... um... it just occurred to me: if we detect that XER sticky overflow is 1 in the ALU output stage, and if it is set (and required - insn.OE etc.) only then do we even request a write to the XER.so register, then because it is only setting, i do not believe we even need to read the XER.so at all. as in: i think we can actually *remove* XER.so from ALUInputData, entirely. this is because we don't actually care what the former value was - at all - we are merely setting it. and because that new value is *not* dependent on the old value - at all - we don't need to waste that read port. does this sound like a reasonable analysis?
(In reply to Luke Kenneth Casson Leighton from comment #83) > um... um... it just occurred to me: if we detect that XER sticky overflow > is 1 in the ALU output stage, and if it is set (and required - insn.OE etc.) > only then do we even request a write to the XER.so register, then because > it is only setting, i do not believe we even need to read the XER.so at all. > > as in: i think we can actually *remove* XER.so from ALUInputData, entirely. > > this is because we don't actually care what the former value was - at all - > we are merely setting it. and because that new value is *not* dependent > on the old value - at all - we don't need to waste that read port. > > does this sound like a reasonable analysis? sounds mostly correct to me, as long as the logic for setting CR0's 4th bit still reads the previous value when overflow isn't triggered. SO is a prime candidate for some extra logic to allow multiple instructions to modify it each cycle.
(In reply to Jacob Lifshay from comment #84) > (In reply to Luke Kenneth Casson Leighton from comment #83) > > um... um... it just occurred to me: if we detect that XER sticky overflow > > is 1 in the ALU output stage, and if it is set (and required - insn.OE etc.) > > only then do we even request a write to the XER.so register, then because > > it is only setting, i do not believe we even need to read the XER.so at all. > > > > as in: i think we can actually *remove* XER.so from ALUInputData, entirely. > > > > this is because we don't actually care what the former value was - at all - > > we are merely setting it. and because that new value is *not* dependent > > on the old value - at all - we don't need to waste that read port. > > > > does this sound like a reasonable analysis? > > sounds mostly correct to me, as long as the logic for setting CR0's 4th bit > still reads the previous value when overflow isn't triggered. that's the fly in the ointment. i started removing SO and found that because it is CR0's 4th bit, and because that 4th bit takes XER.SO as input unconditionally as far as the calculation is concerned, XER.SO cannot be removed from the input data structure for ALU. now, it *can* be switched off by the *instruction* - Rc=1/0 - but not by the ALU itself. > SO is a prime > candidate for some extra logic to allow multiple instructions to modify it > each cycle. as SO is being treated like any other register (despite it being only 1 bit wide) it should be picked up by standard WaW (write after write) and unused values thrown away.
i think the best we can do is, when Rc=0, disable reading of XER.so Rc=1 basically makes a hell of a mess of the dependencies.
(In reply to Luke Kenneth Casson Leighton from comment #86) > i think the best we can do is, when Rc=0, disable reading of XER.so > > Rc=1 basically makes a hell of a mess of the dependencies. sounds good. In my very limited experience with disassembling Power code, compilers don't usually enable Rc or overflow writing.
(In reply to Jacob Lifshay from comment #87) > sounds good. In my very limited experience with disassembling Power code, > compilers don't usually enable Rc or overflow writing. that meshes with observations that the spec says performance is adversely affected.
this one is reasonably complete, unit tests ok, formal proof ok
https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=29d0566fd104ab8b9ffd0c460fc7c588a72a340b turns out we had missed 32-bit sign-extension from checking the "L" bit. at some point it may be worthwhile replacing the manual decoding of "L" with a separate CSV file entry which sets the op.is_32bit flag.