333 – investigate why CR pipeline code took 100% CPU and locked up generating ILANG

Bug 333 - investigate why CR pipeline code took 100% CPU and locked up generating ILANG

Summary: investigate why CR pipeline code took 100% CPU and locked up generating ILANG

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Source Code (show other bugs)
Version:	unspecified
Hardware:	PC Mac OS

Importance:	High enhancement
Assignee:	Luke Kenneth Casson Leighton

URL:

Depends on:
Blocks:	314
	Show dependency tree / graph

Reported:	2020-05-20 18:55 BST by Luke Kenneth Casson Leighton
Modified:	2020-06-30 16:50 BST (History)
CC List:	2 users (show)

See Also:	319 345
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2020-05-20 18:55:31 BST

problem is "fixed" by this commit:
https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=60f06a9eb815419f0477a5501c627824da07e078

however should not have occurred in the first place.  needs investigating.

Comment 1 Michael Nolan 2020-05-20 19:16:31 BST

It seems we both stumbled on this issue at about the same time, and arrived at similar solutions. I narrowed down the commit that introduced it to this one: 

commit 7638005f8577ba6545a65a0c6e31c5419f1f684e
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Sun May 17 17:15:58 2020 +0100

    use slightly more elegant way to access CR lookup table

diff --git a/src/soc/cr/main_stage.py b/src/soc/cr/main_stage.py
index e29d5ee..23eba32 100644
--- a/src/soc/cr/main_stage.py
+++ b/src/soc/cr/main_stage.py
@@ -67,9 +67,10 @@ class CRMainStage(PipeModBase):
         # the CR to determine what the resulting bit should be.
 
         # Grab the lookup table for cr_op type instructions
-        lut = Signal(4, reset_less=True)
+        lut = Array([Signal(name=f"lut{i}") for i in range(4)])
         # There's no field, just have to grab it directly from the insn
-        comb += lut.eq(self.i.ctx.op.insn[6:10])
+        for i in range(4):
+            comb += lut[i].eq(self.i.ctx.op.insn[6+i])
 
         # Generate the mask for mtcrf, mtocrf, and mfocrf
         fxm = Signal(xfx_fields['FXM'][0:-1].shape())
@@ -108,28 +109,15 @@ class CRMainStage(PipeModBase):
                 comb += ba.eq(xl_fields['BA'][0:-1])
                 comb += bb.eq(xl_fields['BB'][0:-1])
 
-                # Extract the two input bits from the CR
-                bit_a = Signal(reset_less=True)
-                bit_b = Signal(reset_less=True)
-                comb += bit_a.eq(cr_arr[ba])
-                comb += bit_b.eq(cr_arr[bb])
-
-                # Use the two input bits to look up the result in the
-                # lookup table
-                bit_out = Signal(reset_less=True)
-                comb += bit_out.eq(Mux(bit_b,
-                                       Mux(bit_a, lut[3], lut[1]),
-                                       Mux(bit_a, lut[2], lut[0])))
-                # Set the output to the result above
-                comb += cr_out_arr[bt].eq(bit_out)
+                # Use the two input bits to look up the result in the LUT
+                comb += cr_out_arr[bt].eq(lut[Cat(cr_arr[bb], cr_arr[ba])])
 
             ##### mtcrf #####
             with m.Case(InternalOp.OP_MTCRF):
                 # mtocrf and mtcrf are essentially identical
                 # put input (RA) - mask-selected - into output CR, leave
                 # rest of CR alone.
-                comb += cr_o.eq((self.i.a[0:32] & mask) |
-                                     (self.i.cr & ~mask))
+                comb += cr_o.eq((self.i.a[0:32] & mask) | (self.i.cr & ~mask))
 
             ##### mfcr #####
             with m.Case(InternalOp.OP_MFCR):


I think what happens is nmigen generates something really big for arr[index]. Since that gets used to index lut, and then gets assigned to cr_out_arr[idx], it generates a gigantic expression. By assigning the temporary expressions to signals, this doesn't happen. Just a guess

Comment 2 Luke Kenneth Casson Leighton 2020-05-20 19:37:56 BST

(In reply to Michael Nolan from comment #1)

> I think what happens is nmigen generates something really big for
> arr[index]. Since that gets used to index lut, and then gets assigned to
> cr_out_arr[idx], it generates a gigantic expression. By assigning the
> temporary expressions to signals, this doesn't happen. Just a guess

that's twice we've encountered problems related to pmux (Array-indexing).
the other one being PriorityEncoder.

ah hang on.  i think i know.  if the result of the lut is assigned to
a temporary signal *before* going in as the assignment...

yeah.

Comment 3 Luke Kenneth Casson Leighton 2020-05-20 19:45:48 BST

reduces ilang size slightly but not a lot.  need to view the graph (on
a different computer)

Comment 4 Luke Kenneth Casson Leighton 2020-05-20 19:52:28 BST

took a look: it takes *30 seconds* to process the graphviz.  that's a baaad
sign.

when it finally showed up, it's a full crossbar.  wark-wark...

Comment 5 Luke Kenneth Casson Leighton 2020-05-20 19:59:26 BST

cut it down drastically by not using an Array.  sigh.  they were so good,
too.

the cr_out_array is still producing a full crossbar.  it's dreadful.
given that we have to do that reg-splitting for XER, i'm inclined to
suggest we do the same here, and to have inputs and outputs as
crN, indexed, selected and routed *externally* like RA, RB, RC and RT/RS.

Comment 6 Luke Kenneth Casson Leighton 2020-05-20 20:39:19 BST

this is dreadful:

                for i in range(4):
                    comb += cr_out_arr[bf*4 + i].eq(cr_arr[bfa*4 + i])

the computation of the index is done almost a hundred times or more.
32x4 times.

for now it can be done again with a mask (shifted).  two masks - one
to shift-extract cr[bfa*4..+4], shift by the *difference* between
bf and bfa then bit-mask in.

dreadful.

we really need to split CR out into its 8 separate CRs.  XER's fields
are going to have to be done this way anyway.

Comment 7 Michael Nolan 2020-05-20 20:42:50 BST

(In reply to Luke Kenneth Casson Leighton from comment #6)
> this is dreadful:
> 
>                 for i in range(4):
>                     comb += cr_out_arr[bf*4 + i].eq(cr_arr[bfa*4 + i])
> 
> the computation of the index is done almost a hundred times or more.
> 32x4 times.
> 
> for now it can be done again with a mask (shifted).  two masks - one
> to shift-extract cr[bfa*4..+4], shift by the *difference* between
> bf and bfa then bit-mask in.
> 
> dreadful.
> 
> we really need to split CR out into its 8 separate CRs.  XER's fields
> are going to have to be done this way anyway.

So for this module, how would that work? I'm assuming we wouldn't be passing in all 8 CRs, because that would be just as bad as what we have now. But we need the full CR for mfcr and mtcrf. Would we have 2 small cr inputs as well as the full cr, and depending on whether which opcode it is we select which one to use?

Comment 8 Luke Kenneth Casson Leighton 2020-05-20 20:46:21 BST

(In reply to Michael Nolan from comment #7)

> So for this module, how would that work? I'm assuming we wouldn't be passing
> in all 8 CRs, because that would be just as bad as what we have now. But we
> need the full CR for mfcr and mtcrf. Would we have 2 small cr inputs as well
> as the full cr, and depending on whether which opcode it is we select which
> one to use?

yes basically.  perhaps even use the 32 bit path to get the smaller regs to it.

Comment 9 Luke Kenneth Casson Leighton 2020-05-21 03:05:38 BST

just as spr and reg is decoded in power_decode2.py for RA RB RC and output, csv files would have "reg is CR" enum option.

each DecodeRegA (etc) would then note that and place BFA insn fields (etc) into not a reg_data or an spr_data but a cr_data field and indicate that they are valid.

these - 3 - CR reg fields then get put into Execute1Type just like reg and spr data.

CrOpSubset (to be created) then picks those up.

the instruction issue would also note them and request DM Matrix Reservations just like it would for RA RB RC RS/T except for the Condition Regfile (and Matrix) not the INT regfile.

Comment 10 Luke Kenneth Casson Leighton 2020-05-21 17:02:24 BST

hm need to decide the register allocation scheme.  it is ok to have 4bit CR go into a 32bit operand but not to have CR data go down an INT path.

or, if we do, MUXes are needed to cross data from one series of regfile paths to another.

this could be done for when we have the cyclic buffers in place, easily.

Comment 11 Michael Nolan 2020-05-21 18:23:27 BST

Luke, there's not a way to do something like a C union in nmigen is there? I'm thinking for the CR pipeline to have it put one or two of the crs in the 32 bit cr input for the cases where it doesn't need the full cr as input.

Comment 12 Luke Kenneth Casson Leighton 2020-05-21 18:32:37 BST

(In reply to Michael Nolan from comment #11)
> Luke, there's not a way to do something like a C union in nmigen is there?

no, you just push bits around.  it's up to the recipient to have the required state to correctly interpret them.


> I'm thinking for the CR pipeline to have it put one or two of the crs in the
> 32 bit cr input for the cases where it doesn't need the full cr as input.

mmmm that again involves "lane crossing MUXes" at the operand buses.  it is slightly better in that it is not crossing a regfile but it is still nontrivial.

however the (planned) cyclic buffers now become even more complex, because each register could contain two values not one.

oh, i saw the CSV changes, adding the extra field.  i was going to suggest simply using src1 src2 src3, by adding "CR" type to the Reg Enum :)

(sorry for not mentioning that sooner...)

then if src1 is CR and op is.. whatever, then DecodeA drops the BFA field (whatever) into... mmm... rega.cr_out

etc.

the CRin field just added could be part-used for that purpose.  it contains in effect the same information, without needing the OP_CRXXX test at DecodeA/B/C/O

Comment 13 Luke Kenneth Casson Leighton 2020-05-21 18:39:24 BST

(In reply to Luke Kenneth Casson Leighton from comment #12)

> however the (planned) cyclic buffers now become even more complex, because
> each register could contain two values not one.

answer: it would not be a good idea. much simpler - cleaner - to have extra 4 bit data path(s) for the src CR regs that won't go into src1.

remember, it would mean that *at the regfile* access to a *pair* of 4 bit Read Ports would need to be synchronised, just to make the transfer of 2 simultaneous 4 bit values down that one 32 bit path.

this in addition to making the cyclic buffers far more complex.

Comment 14 Luke Kenneth Casson Leighton 2020-05-21 18:46:02 BST

(In reply to Luke Kenneth Casson Leighton from comment #12)
> (In reply to Michael Nolan from comment #11)
> > Luke, there's not a way to do something like a C union in nmigen is there?
> 
> no, you just push bits around.  it's up to the recipient to have the
> required state to correctly interpret them.

everything in nmigen comes with an implicit typecast to void* ... :)


whitequark would very much like to fix this.  we actually use it in a lot of places.

however the fix is a massive systematic fundamental design rewrite and would involve introducing c-like unions.

Comment 15 Michael Nolan 2020-05-21 18:50:25 BST

(In reply to Luke Kenneth Casson Leighton from comment #13)
> (In reply to Luke Kenneth Casson Leighton from comment #12)
> 
> > however the (planned) cyclic buffers now become even more complex, because
> > each register could contain two values not one.
> 
> answer: it would not be a good idea. much simpler - cleaner - to have extra
> 4 bit data path(s) for the src CR regs that won't go into src1.
> 
> remember, it would mean that *at the regfile* access to a *pair* of 4 bit
> Read Ports would need to be synchronised, just to make the transfer of 2
> simultaneous 4 bit values down that one 32 bit path.
> 
> this in addition to making the cyclic buffers far more complex.

Ok. I notice there's now a regspec array for the input and output data structs. What would a 4 bit CR be?

Comment 16 Luke Kenneth Casson Leighton 2020-05-21 19:30:38 BST

(In reply to Michael Nolan from comment #15)

> Ok. I notice there's now a regspec array for the input and output data
> structs. What would a 4 bit CR be?

ermmm... ermermerm.... probably just "CR", and a range "0:3" (it's inclusive,
unlike python slices)

Comment 17 Luke Kenneth Casson Leighton 2020-05-21 19:34:36 BST

(In reply to Luke Kenneth Casson Leighton from comment #16)
> (In reply to Michael Nolan from comment #15)
> 
> > Ok. I notice there's now a regspec array for the input and output data
> > structs. What would a 4 bit CR be?
> 
> ermmm... ermermerm.... probably just "CR", and a range "0:3" (it's inclusive,
> unlike python slices)

ok sorry:

* for the field that's supposed to fit the full CR, it remains 0:31 (32 bits)

  we will drop only 4 bits into it for src-cr-A (read_cr1) for BA,
  and a full 32-bits for CR_MTCRF (etc) involving XFX mask.

  so src-cr-A would need to be a full 32-bit wide.

* for the 2nd field, it will be 0:3 (4 bits)

  there it will only be 4 bits wide so src-cr-B operand would only be 4 bits

Comment 18 Luke Kenneth Casson Leighton 2020-05-21 19:35:40 BST

oh!  nice!  you already put "CR0" into the input field of all the CSV files!

Comment 19 Luke Kenneth Casson Leighton 2020-05-21 19:43:11 BST

https://bugs.libre-soc.org/show_bug.cgi?id=316#c79

    self.cr_out.eq(CROutSel[row['CR out']]),
  File "/usr/lib/python3.7/enum.py", line 351, in __getitem__
    return cls._member_map_[name]
KeyError: '0'

unit tests not happy any more, michael.

Comment 20 Luke Kenneth Casson Leighton 2020-05-21 19:56:03 BST

@unique
class CROutSel(Enum):
    NONE = 0
    CR0 = 1
    BF = 2
    BT = 3
    WHOLE_REG = 4

ah.  whoops.  minor_31.csv, CR0 has been put as the "CR in", where it should be "NONE", and it should be "CR out" that should be CR0.

i will adjust minor_31.csv

Comment 21 Luke Kenneth Casson Leighton 2020-05-21 19:59:36 BST

(In reply to Luke Kenneth Casson Leighton from comment #20)

> i will adjust minor_31.csv

saw the commit, looks like you did it already, will double-check.

Comment 22 Luke Kenneth Casson Leighton 2020-05-21 20:08:03 BST

michael, as cr_c is an output field, it should only go in CROutputData.
so i'm just going to remove it.  we can discuss whether to put cr_a into
full_cr later, there is justification for doing it either way.

i'll also update the pipe_data.py regspecs

Comment 23 Michael Nolan 2020-05-21 20:09:49 BST

(In reply to Luke Kenneth Casson Leighton from comment #22)
> michael, as cr_c is an output field, it should only go in CROutputData.
> so i'm just going to remove it.  we can discuss whether to put cr_a into
> full_cr later, there is justification for doing it either way.
> 
> i'll also update the pipe_data.py regspecs

No, it's not an output field. For OP_CROP, it only does a partial update to the output cr, so it needs to know what it was before.

Comment 24 Luke Kenneth Casson Leighton 2020-05-21 20:22:24 BST

this creates a false (unnecessary) dependency, where cr_c would need to be
read on every OP_CR* operation, and "passed through".  the idea is to
enable cr_o.ok for instances where it's needed.

or, are there *really* operations that need 3 4-bit CRs as input?  i
don't believe so


@@ -53,9 +53,6 @@ class CRMainStage(PipeModBase):
         cr_b_arr = Array([cr_b[i] for i in range(4)])
         cr_o_arr = Array([cr_o[i] for i in range(4)])
 
-        comb += cr_o.eq(cr_c)
-
-
         with m.Switch(op.insn_type):
             ##### mcrf #####
             with m.Case(InternalOp.OP_MCRF):

Comment 25 Luke Kenneth Casson Leighton 2020-05-21 20:23:14 BST

(In reply to Michael Nolan from comment #23)
> (In reply to Luke Kenneth Casson Leighton from comment #22)
> > michael, as cr_c is an output field, it should only go in CROutputData.
> > so i'm just going to remove it.  we can discuss whether to put cr_a into
> > full_cr later, there is justification for doing it either way.
> > 
> > i'll also update the pipe_data.py regspecs
> 
> No, it's not an output field. For OP_CROP, it only does a partial update to
> the output cr, so it needs to know what it was before.

oh god.  blech.  ok.  yes.  got it.  lucky i didn't commit anything, eh?

that just leaves updating the regspecs, which i'll do now.

l.

Comment 26 Luke Kenneth Casson Leighton 2020-05-21 20:28:09 BST

ls -altr

      90820 May 21 20:27 cr_pipeline.il

WOW.  down from 600k to 90k.

absolutely massive difference.

Comment 27 Luke Kenneth Casson Leighton 2020-05-21 20:37:41 BST

we should also do a CompCROpSubset rather than read from Decode2ExecuteType.
i'll add one now (but not hook it in) so it can be reviewed

Comment 28 Luke Kenneth Casson Leighton 2020-05-21 20:50:17 BST

ehhhm... i don't actually think main_stage.py needs *any* of these.

* registers are identified and passed in according to DecodeCRIn/Out
  in combination with regspecs

* fn_unit might be a good idea to leave in

* insn_type is pretty much literally the only thing that's actually
  used, isn't it?

there's nothing in CR main_stage.py which is different for 32/64 bit, is there?
read_cr_whole and write_cr_whole might be useful information if we decide
to do MUXing, but that would be back at the CRCompUnit

do we need input_cr or output_cr - i mean in main_stage.py?  i don't
think we do.


class CompCROpSubset(Record):
    def __init__(self, name=None):
        layout = (('insn_type', InternalOp),
                  ('fn_unit', Function),
                  ('input_cr', CRInSel),
                  ('output_cr', CROutSel),
                  ('read_cr_whole', 1),
                  ('write_cr_whole', 1),
                  ('is_32bit', 1),
                  )

Comment 29 Luke Kenneth Casson Leighton 2020-05-21 20:59:34 BST

ah, just realised: the convention for outgoing regs is to use Data()
main_stage.py (or output_stage.py) then sets "ok" to indicate that
it has been modified.  will do that now.

Comment 30 Michael Nolan 2020-05-22 14:36:07 BST

(In reply to Luke Kenneth Casson Leighton from comment #26)
> ls -altr
> 
>       90820 May 21 20:27 cr_pipeline.il
> 
> WOW.  down from 600k to 90k.
> 
> absolutely massive difference.

Gates are down too, from 2186 to 818



Finally managed to get the CR proof working by leaving the original proof bits intact, and adding machinery to select the cr inputs and combine it back into the outputs.

Comment 31 Luke Kenneth Casson Leighton 2020-05-22 14:49:14 BST

(In reply to Michael Nolan from comment #30)
> (In reply to Luke Kenneth Casson Leighton from comment #26)
> > ls -altr
> > 
> >       90820 May 21 20:27 cr_pipeline.il
> > 
> > WOW.  down from 600k to 90k.
> > 
> > absolutely massive difference.
> 
> Gates are down too, from 2186 to 818

superb.  i don't believe we switched one problem for another: the
Condition Regfile can be designed in a manageable fashion, not
involving accidental creation of massive crossbars.

> Finally managed to get the CR proof working by leaving the original proof
> bits intact, and adding machinery to select the cr inputs and combine it
> back into the outputs.

eek :)