problem is "fixed" by this commit: https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=60f06a9eb815419f0477a5501c627824da07e078 however should not have occurred in the first place. needs investigating.
It seems we both stumbled on this issue at about the same time, and arrived at similar solutions. I narrowed down the commit that introduced it to this one: commit 7638005f8577ba6545a65a0c6e31c5419f1f684e Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Sun May 17 17:15:58 2020 +0100 use slightly more elegant way to access CR lookup table diff --git a/src/soc/cr/main_stage.py b/src/soc/cr/main_stage.py index e29d5ee..23eba32 100644 --- a/src/soc/cr/main_stage.py +++ b/src/soc/cr/main_stage.py @@ -67,9 +67,10 @@ class CRMainStage(PipeModBase): # the CR to determine what the resulting bit should be. # Grab the lookup table for cr_op type instructions - lut = Signal(4, reset_less=True) + lut = Array([Signal(name=f"lut{i}") for i in range(4)]) # There's no field, just have to grab it directly from the insn - comb += lut.eq(self.i.ctx.op.insn[6:10]) + for i in range(4): + comb += lut[i].eq(self.i.ctx.op.insn[6+i]) # Generate the mask for mtcrf, mtocrf, and mfocrf fxm = Signal(xfx_fields['FXM'][0:-1].shape()) @@ -108,28 +109,15 @@ class CRMainStage(PipeModBase): comb += ba.eq(xl_fields['BA'][0:-1]) comb += bb.eq(xl_fields['BB'][0:-1]) - # Extract the two input bits from the CR - bit_a = Signal(reset_less=True) - bit_b = Signal(reset_less=True) - comb += bit_a.eq(cr_arr[ba]) - comb += bit_b.eq(cr_arr[bb]) - - # Use the two input bits to look up the result in the - # lookup table - bit_out = Signal(reset_less=True) - comb += bit_out.eq(Mux(bit_b, - Mux(bit_a, lut[3], lut[1]), - Mux(bit_a, lut[2], lut[0]))) - # Set the output to the result above - comb += cr_out_arr[bt].eq(bit_out) + # Use the two input bits to look up the result in the LUT + comb += cr_out_arr[bt].eq(lut[Cat(cr_arr[bb], cr_arr[ba])]) ##### mtcrf ##### with m.Case(InternalOp.OP_MTCRF): # mtocrf and mtcrf are essentially identical # put input (RA) - mask-selected - into output CR, leave # rest of CR alone. - comb += cr_o.eq((self.i.a[0:32] & mask) | - (self.i.cr & ~mask)) + comb += cr_o.eq((self.i.a[0:32] & mask) | (self.i.cr & ~mask)) ##### mfcr ##### with m.Case(InternalOp.OP_MFCR): I think what happens is nmigen generates something really big for arr[index]. Since that gets used to index lut, and then gets assigned to cr_out_arr[idx], it generates a gigantic expression. By assigning the temporary expressions to signals, this doesn't happen. Just a guess
(In reply to Michael Nolan from comment #1) > I think what happens is nmigen generates something really big for > arr[index]. Since that gets used to index lut, and then gets assigned to > cr_out_arr[idx], it generates a gigantic expression. By assigning the > temporary expressions to signals, this doesn't happen. Just a guess that's twice we've encountered problems related to pmux (Array-indexing). the other one being PriorityEncoder. ah hang on. i think i know. if the result of the lut is assigned to a temporary signal *before* going in as the assignment... yeah.
reduces ilang size slightly but not a lot. need to view the graph (on a different computer)
took a look: it takes *30 seconds* to process the graphviz. that's a baaad sign. when it finally showed up, it's a full crossbar. wark-wark...
cut it down drastically by not using an Array. sigh. they were so good, too. the cr_out_array is still producing a full crossbar. it's dreadful. given that we have to do that reg-splitting for XER, i'm inclined to suggest we do the same here, and to have inputs and outputs as crN, indexed, selected and routed *externally* like RA, RB, RC and RT/RS.
this is dreadful: for i in range(4): comb += cr_out_arr[bf*4 + i].eq(cr_arr[bfa*4 + i]) the computation of the index is done almost a hundred times or more. 32x4 times. for now it can be done again with a mask (shifted). two masks - one to shift-extract cr[bfa*4..+4], shift by the *difference* between bf and bfa then bit-mask in. dreadful. we really need to split CR out into its 8 separate CRs. XER's fields are going to have to be done this way anyway.
(In reply to Luke Kenneth Casson Leighton from comment #6) > this is dreadful: > > for i in range(4): > comb += cr_out_arr[bf*4 + i].eq(cr_arr[bfa*4 + i]) > > the computation of the index is done almost a hundred times or more. > 32x4 times. > > for now it can be done again with a mask (shifted). two masks - one > to shift-extract cr[bfa*4..+4], shift by the *difference* between > bf and bfa then bit-mask in. > > dreadful. > > we really need to split CR out into its 8 separate CRs. XER's fields > are going to have to be done this way anyway. So for this module, how would that work? I'm assuming we wouldn't be passing in all 8 CRs, because that would be just as bad as what we have now. But we need the full CR for mfcr and mtcrf. Would we have 2 small cr inputs as well as the full cr, and depending on whether which opcode it is we select which one to use?
(In reply to Michael Nolan from comment #7) > So for this module, how would that work? I'm assuming we wouldn't be passing > in all 8 CRs, because that would be just as bad as what we have now. But we > need the full CR for mfcr and mtcrf. Would we have 2 small cr inputs as well > as the full cr, and depending on whether which opcode it is we select which > one to use? yes basically. perhaps even use the 32 bit path to get the smaller regs to it.
just as spr and reg is decoded in power_decode2.py for RA RB RC and output, csv files would have "reg is CR" enum option. each DecodeRegA (etc) would then note that and place BFA insn fields (etc) into not a reg_data or an spr_data but a cr_data field and indicate that they are valid. these - 3 - CR reg fields then get put into Execute1Type just like reg and spr data. CrOpSubset (to be created) then picks those up. the instruction issue would also note them and request DM Matrix Reservations just like it would for RA RB RC RS/T except for the Condition Regfile (and Matrix) not the INT regfile.
hm need to decide the register allocation scheme. it is ok to have 4bit CR go into a 32bit operand but not to have CR data go down an INT path. or, if we do, MUXes are needed to cross data from one series of regfile paths to another. this could be done for when we have the cyclic buffers in place, easily.
Luke, there's not a way to do something like a C union in nmigen is there? I'm thinking for the CR pipeline to have it put one or two of the crs in the 32 bit cr input for the cases where it doesn't need the full cr as input.
(In reply to Michael Nolan from comment #11) > Luke, there's not a way to do something like a C union in nmigen is there? no, you just push bits around. it's up to the recipient to have the required state to correctly interpret them. > I'm thinking for the CR pipeline to have it put one or two of the crs in the > 32 bit cr input for the cases where it doesn't need the full cr as input. mmmm that again involves "lane crossing MUXes" at the operand buses. it is slightly better in that it is not crossing a regfile but it is still nontrivial. however the (planned) cyclic buffers now become even more complex, because each register could contain two values not one. oh, i saw the CSV changes, adding the extra field. i was going to suggest simply using src1 src2 src3, by adding "CR" type to the Reg Enum :) (sorry for not mentioning that sooner...) then if src1 is CR and op is.. whatever, then DecodeA drops the BFA field (whatever) into... mmm... rega.cr_out etc. the CRin field just added could be part-used for that purpose. it contains in effect the same information, without needing the OP_CRXXX test at DecodeA/B/C/O
(In reply to Luke Kenneth Casson Leighton from comment #12) > however the (planned) cyclic buffers now become even more complex, because > each register could contain two values not one. answer: it would not be a good idea. much simpler - cleaner - to have extra 4 bit data path(s) for the src CR regs that won't go into src1. remember, it would mean that *at the regfile* access to a *pair* of 4 bit Read Ports would need to be synchronised, just to make the transfer of 2 simultaneous 4 bit values down that one 32 bit path. this in addition to making the cyclic buffers far more complex.
(In reply to Luke Kenneth Casson Leighton from comment #12) > (In reply to Michael Nolan from comment #11) > > Luke, there's not a way to do something like a C union in nmigen is there? > > no, you just push bits around. it's up to the recipient to have the > required state to correctly interpret them. everything in nmigen comes with an implicit typecast to void* ... :) whitequark would very much like to fix this. we actually use it in a lot of places. however the fix is a massive systematic fundamental design rewrite and would involve introducing c-like unions.
(In reply to Luke Kenneth Casson Leighton from comment #13) > (In reply to Luke Kenneth Casson Leighton from comment #12) > > > however the (planned) cyclic buffers now become even more complex, because > > each register could contain two values not one. > > answer: it would not be a good idea. much simpler - cleaner - to have extra > 4 bit data path(s) for the src CR regs that won't go into src1. > > remember, it would mean that *at the regfile* access to a *pair* of 4 bit > Read Ports would need to be synchronised, just to make the transfer of 2 > simultaneous 4 bit values down that one 32 bit path. > > this in addition to making the cyclic buffers far more complex. Ok. I notice there's now a regspec array for the input and output data structs. What would a 4 bit CR be?
(In reply to Michael Nolan from comment #15) > Ok. I notice there's now a regspec array for the input and output data > structs. What would a 4 bit CR be? ermmm... ermermerm.... probably just "CR", and a range "0:3" (it's inclusive, unlike python slices)
(In reply to Luke Kenneth Casson Leighton from comment #16) > (In reply to Michael Nolan from comment #15) > > > Ok. I notice there's now a regspec array for the input and output data > > structs. What would a 4 bit CR be? > > ermmm... ermermerm.... probably just "CR", and a range "0:3" (it's inclusive, > unlike python slices) ok sorry: * for the field that's supposed to fit the full CR, it remains 0:31 (32 bits) we will drop only 4 bits into it for src-cr-A (read_cr1) for BA, and a full 32-bits for CR_MTCRF (etc) involving XFX mask. so src-cr-A would need to be a full 32-bit wide. * for the 2nd field, it will be 0:3 (4 bits) there it will only be 4 bits wide so src-cr-B operand would only be 4 bits
oh! nice! you already put "CR0" into the input field of all the CSV files!
https://bugs.libre-soc.org/show_bug.cgi?id=316#c79 self.cr_out.eq(CROutSel[row['CR out']]), File "/usr/lib/python3.7/enum.py", line 351, in __getitem__ return cls._member_map_[name] KeyError: '0' unit tests not happy any more, michael.
@unique class CROutSel(Enum): NONE = 0 CR0 = 1 BF = 2 BT = 3 WHOLE_REG = 4 ah. whoops. minor_31.csv, CR0 has been put as the "CR in", where it should be "NONE", and it should be "CR out" that should be CR0. i will adjust minor_31.csv
(In reply to Luke Kenneth Casson Leighton from comment #20) > i will adjust minor_31.csv saw the commit, looks like you did it already, will double-check.
michael, as cr_c is an output field, it should only go in CROutputData. so i'm just going to remove it. we can discuss whether to put cr_a into full_cr later, there is justification for doing it either way. i'll also update the pipe_data.py regspecs
(In reply to Luke Kenneth Casson Leighton from comment #22) > michael, as cr_c is an output field, it should only go in CROutputData. > so i'm just going to remove it. we can discuss whether to put cr_a into > full_cr later, there is justification for doing it either way. > > i'll also update the pipe_data.py regspecs No, it's not an output field. For OP_CROP, it only does a partial update to the output cr, so it needs to know what it was before.
this creates a false (unnecessary) dependency, where cr_c would need to be read on every OP_CR* operation, and "passed through". the idea is to enable cr_o.ok for instances where it's needed. or, are there *really* operations that need 3 4-bit CRs as input? i don't believe so @@ -53,9 +53,6 @@ class CRMainStage(PipeModBase): cr_b_arr = Array([cr_b[i] for i in range(4)]) cr_o_arr = Array([cr_o[i] for i in range(4)]) - comb += cr_o.eq(cr_c) - - with m.Switch(op.insn_type): ##### mcrf ##### with m.Case(InternalOp.OP_MCRF):
(In reply to Michael Nolan from comment #23) > (In reply to Luke Kenneth Casson Leighton from comment #22) > > michael, as cr_c is an output field, it should only go in CROutputData. > > so i'm just going to remove it. we can discuss whether to put cr_a into > > full_cr later, there is justification for doing it either way. > > > > i'll also update the pipe_data.py regspecs > > No, it's not an output field. For OP_CROP, it only does a partial update to > the output cr, so it needs to know what it was before. oh god. blech. ok. yes. got it. lucky i didn't commit anything, eh? that just leaves updating the regspecs, which i'll do now. l.
ls -altr 90820 May 21 20:27 cr_pipeline.il WOW. down from 600k to 90k. absolutely massive difference.
we should also do a CompCROpSubset rather than read from Decode2ExecuteType. i'll add one now (but not hook it in) so it can be reviewed
ehhhm... i don't actually think main_stage.py needs *any* of these. * registers are identified and passed in according to DecodeCRIn/Out in combination with regspecs * fn_unit might be a good idea to leave in * insn_type is pretty much literally the only thing that's actually used, isn't it? there's nothing in CR main_stage.py which is different for 32/64 bit, is there? read_cr_whole and write_cr_whole might be useful information if we decide to do MUXing, but that would be back at the CRCompUnit do we need input_cr or output_cr - i mean in main_stage.py? i don't think we do. class CompCROpSubset(Record): def __init__(self, name=None): layout = (('insn_type', InternalOp), ('fn_unit', Function), ('input_cr', CRInSel), ('output_cr', CROutSel), ('read_cr_whole', 1), ('write_cr_whole', 1), ('is_32bit', 1), )
ah, just realised: the convention for outgoing regs is to use Data() main_stage.py (or output_stage.py) then sets "ok" to indicate that it has been modified. will do that now.
(In reply to Luke Kenneth Casson Leighton from comment #26) > ls -altr > > 90820 May 21 20:27 cr_pipeline.il > > WOW. down from 600k to 90k. > > absolutely massive difference. Gates are down too, from 2186 to 818 Finally managed to get the CR proof working by leaving the original proof bits intact, and adding machinery to select the cr inputs and combine it back into the outputs.
(In reply to Michael Nolan from comment #30) > (In reply to Luke Kenneth Casson Leighton from comment #26) > > ls -altr > > > > 90820 May 21 20:27 cr_pipeline.il > > > > WOW. down from 600k to 90k. > > > > absolutely massive difference. > > Gates are down too, from 2186 to 818 superb. i don't believe we switched one problem for another: the Condition Regfile can be designed in a manageable fashion, not involving accidental creation of massive crossbars. > Finally managed to get the CR proof working by leaving the original proof > bits intact, and adding machinery to select the cr inputs and combine it > back into the outputs. eek :)