Bug 469 - Create D-cache from microwatt dcache.vhdl
Summary: Create D-cache from microwatt dcache.vhdl
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Linux
: Normal normal
Deadline: 2020-09-30
Assignee: Cole Poirier
URL:
Depends on: 450 502
Blocks: 383 491
  Show dependency treegraph
 
Reported: 2020-08-18 22:22 BST by Cole Poirier
Modified: 2021-05-07 04:12 BST (History)
3 users (show)

See Also:
NLnet milestone: NLnet.2019.02
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation: 51
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Cole Poirier 2020-08-18 22:22:46 BST
The goal is to achieve a working D-cache by translating microwatt's vhdl dcache code

***Must follow up on nmigen SRAM attributes***

also TODO: cache_ram.vhdl
Comment 1 Cole Poirier 2020-08-21 03:51:53 BST
Sorry, I just now committed the work I've done on this translation from yesterday and today. I was working yesterday, and had to leave the house immediately once I noticed the time in order to not miss an appointment, so I didn't have time to properly review my git diff and write a commit message. Again, apologies.

Luke, can you give me some guidance on this specifically relating to tranlading vhdl generate blocks, port maps, and generic maps into nmigen? (and a few other things that I've skipped due to lack of understanding but are marked in the file with comments that contain the keyword TODO).

Thanks in advance, as I will see your comments in the morning :)
Comment 2 Luke Kenneth Casson Leighton 2020-08-21 12:02:14 BST
done some, you will need to read, extrapolate and resolve others.

* range(THE_CONSTANT) not range (SomeFunctionReturningASignal())
* remove all default values of all function parameter arguments
* remove "severity failure" and put comments either back on one
  line or use backslash and two strings, one per line
Comment 3 Luke Kenneth Casson Leighton 2020-08-21 22:11:40 BST
#     subtype way_t is integer range 0 to NUM_WAYS-1;


therefore wherever you *see* way_t, replace it with "NUM_WAYS".

#         hit_way   : way_t;

                self.hit_way   = Signal(WAY)

no - this needs to be:

                self.hit_way   = Signal(NUM_WAYS)

likewise:

#         forward_row1   : row_t;

                self.forward_row1     = Signal(ROW)

no - look at the definition of row_t:

#     subtype row_t is integer range 0 to BRAM_ROWS-1;

therefore that needs to be:

                self.forward_row1     = Signal(BRAM_ROWS)

etc. etc.  exactly the same for INDEX, and probably CACHE_TAG as well.
Comment 4 Luke Kenneth Casson Leighton 2020-08-22 01:02:50 BST
right, ok, not quite correct.
log2_int(NUM_WAYS).

etc.

i suggest creating some constante szNUM_WAYS = log2...etc
Comment 5 Cole Poirier 2020-08-24 01:51:07 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)
> #     subtype way_t is integer range 0 to NUM_WAYS-1;
> 
> 
> therefore wherever you *see* way_t, replace it with "NUM_WAYS".
> 
> #         hit_way   : way_t;
> 
>                 self.hit_way   = Signal(WAY)
> 
> no - this needs to be:
> 
>                 self.hit_way   = Signal(NUM_WAYS)
> 
> likewise:
> 
> #         forward_row1   : row_t;
> 
>                 self.forward_row1     = Signal(ROW)
> 
> no - look at the definition of row_t:
> 
> #     subtype row_t is integer range 0 to BRAM_ROWS-1;
> 
> therefore that needs to be:
> 
>                 self.forward_row1     = Signal(BRAM_ROWS)
> 
> etc. etc.  exactly the same for INDEX, and probably CACHE_TAG as well.

So each subtype unless it is an array is simply an aliasing of the constant it specifies directly or through one or more levels of subtyping indirection?
Comment 6 Cole Poirier 2020-08-24 01:52:10 BST
(In reply to Luke Kenneth Casson Leighton from comment #4)
> right, ok, not quite correct.
> log2_int(NUM_WAYS).
> 
> etc.
> 
> i suggest creating some constante szNUM_WAYS = log2...etc

Sorry I'm having trouble understanding what you're trying to convey here. Can you please explain it a bit more verbosely?
Comment 7 Cole Poirier 2020-08-24 01:59:36 BST
(In reply to Cole Poirier from comment #1)

> Luke, can you give me some guidance on this specifically relating to
> tranlading vhdl generate blocks, port maps, and generic maps into nmigen?
> (and a few other things that I've skipped due to lack of understanding but
> are marked in the file with comments that contain the keyword TODO).

I've finished my first translation pass of dcache.vhdl -> dcache.py.
As I stated in this commit I've left around 5% undone as they involved generate and instance statements that I couldn't figure out how to translate and would like some help and guidance on.

As I said in my earlier comment (quoted above), this also include inputs to modules in the form of port and generic, maps.

Additionally I'm 100% sure I'm not setting up the classes/modules right, and looking at mmu.py gives me a good idea of how to fix this. I just don't have the energy left to do this today.

So, can you provide some corrections and help with the things I've asked for (marked in the file with comments containing the keyword TODO), and leave for me to make an attempt at tomorrow: the fixing of the arrangment of things I created classes for into functions as they are in mmu.py?

Thanks, as always for your help and guidance :)
Comment 8 Luke Kenneth Casson Leighton 2020-08-24 02:46:46 BST

szNUM_WAYS = log2_int(NUM_WAYS)
self.hit_way   = Signal(szNUM_WAYS)
Comment 9 Luke Kenneth Casson Leighton 2020-08-24 02:48:18 BST
(In reply to Cole Poirier from comment #7)

> Thanks, as always for your help and guidance :)

3am at the moment, will do some tomorrow.
Comment 10 Luke Kenneth Casson Leighton 2020-08-24 12:43:46 BST
(In reply to Luke Kenneth Casson Leighton from comment #8)
> 
> szNUM_WAYS = log2_int(NUM_WAYS)
> self.hit_way   = Signal(szNUM_WAYS)

commit b8e1e712b6aa41fa2f084bceb808672804a4abdc (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Mon Aug 24 12:37:32 2020 +0100

    use WAY_BITS in appropriate locations


ok.

* numbers are in binary and way_t is declared as 0..NUM_WAYS-1
* therefore it takes log2_int(NUM_WAYS) bits to *store* a number in a Signal
  that will cover all values 0..NUM_WAYS-1
* turns out that WAY_BITS has already been assigned to log2_int(NUM_WAYS).
* therefore Signals of type "way_t" must be declared as Signal(WAY_BITS)

you'll need to be very patient and focussed, going step-by-step and blocking
out everything other than what you are focussing on.

bear in mind that a for-loop from 0 to NUM_WAYS-1 is *not* to replaced with
WAY_BITS i.e. this is... errr... not correct

#                 for i in way_t loop
                for i in range(WAY):

_this_ is correct:

#                 for i in way_t loop
                for i in range(NUM_WAYS):


commit b45fab94d8370ef8297b9bb303bf0d6fc161193e (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Mon Aug 24 12:43:34 2020 +0100

    "WAY" does not exist - range(NUM_WAYS) was intended
Comment 11 Cole Poirier 2020-08-26 19:09:10 BST
Ok, done all the changes you recommended, and rearranged the file to conform to the example you showed by transmuting my original mmu.py into the current form of mmu.py. Now I *really* need your help with the substance of this module as there are things I simply do not know how to do, all such things are marked with a comment containing the keyword TODO.
Comment 12 Cole Poirier 2020-08-26 21:33:53 BST
Hi Luke, I'm struggling with translating the generate statements and generic/port maps, can you give me some feedback on whether the following is a correct translation? Thank you :)

```dcache.vhdl
-- Generate TLB PLRUs
maybe_tlb_plrus: if TLB_NUM_WAYS > 1 generate
begin
tlb_plrus: for i in 0 to TLB_SET_SIZE - 1 generate
    -- TLB PLRU interface
    signal tlb_plru_acc    : std_ulogic_vector(TLB_WAY_BITS-1 downto 0);
    signal tlb_plru_acc_en : std_ulogic;
    signal tlb_plru_out    : std_ulogic_vector(TLB_WAY_BITS-1 downto 0);
begin
tlb_plru : entity work.plru
    generic map (
        BITS => TLB_WAY_BITS
        )
    port map (
        clk => clk,
        rst => rst,
        acc => tlb_plru_acc,
        acc_en => tlb_plru_acc_en,
        lru => tlb_plru_out
        );
 
    process(all)
    begin
    -- PLRU interface
    if r1.tlb_hit_index = i then
        tlb_plru_acc_en <= r1.tlb_hit;
    else
        tlb_plru_acc_en <= '0';
    end if;
    tlb_plru_acc <= std_ulogic_vector(to_unsigned(r1.tlb_hit_way,TLB_WAY_BITS));
    tlb_plru_victim(i) <= tlb_plru_out;
    end process;
end generate;
end generate;
```

```dcache.py
# Generate TLB PLRUs
def maybe_tlb_plrus(self, m, r1, tlb_plru_victim, acc, acc_en, lru):
    comb = m.d.comb
    sync = m.d.sync
    
    with m.If(TLB_NUM_WAYS > 1):
        for i in range(TLB_SET_SIZE):

            # TLB PLRU interface
            comb += tlb_plru.eq(object)
            comb += tlb_plru.input.tlb_plru_acc.eq(
                     Signal(TLB_WAY_BITS)
                    )    
            comb += tlb_plru.input.tlb_plru_acc_en.eq(Signal())
            comb += tlb_plru.input.tlb_plru_out.eq(
                     Signal(TLB_WAY_BITS)
                    )    
            comb += tlb_plru.input.tlb_plru_acc.eq(acc)
            comb += tlb_plru.input.tlb_plru_acc_en.eq(acc_en)
            comb += tlb_plru.input.tlb_plru_out.eq(lru)
 
            with m.If(r1.tlb_hit_index == i):
                comb += tlb_plru.input.tlb_plru_acc_en.eq(
                         r1.tlb_hit
                        )    
 
            with m.Else():
                comb += tlb_plru.tlb_plru_acc_en.eq(0)
            
            comb += tlb_plru.input.tlb_plru_acc.eq(
                     r1.tlb_hit_way
                    )    
 
            comb += tlb_plru_victim[i].eq(
                     tlb_plru.input.tlb_plru_out
                    )    
```
Comment 13 Luke Kenneth Casson Leighton 2020-08-26 22:05:39 BST
(In reply to Cole Poirier from comment #12)
> Hi Luke, I'm struggling with translating the generate statements and
> generic/port maps, can you give me some feedback on whether the following is
> a correct translation? Thank you :)
> 
> ```dcache.vhdl
> -- Generate TLB PLRUs
> maybe_tlb_plrus: if TLB_NUM_WAYS > 1 generate
> begin
> tlb_plrus: for i in 0 to TLB_SET_SIZE - 1 generate
>     -- TLB PLRU interface
>     signal tlb_plru_acc    : std_ulogic_vector(TLB_WAY_BITS-1 downto 0);
>     signal tlb_plru_acc_en : std_ulogic;
>     signal tlb_plru_out    : std_ulogic_vector(TLB_WAY_BITS-1 downto 0);
> begin
> tlb_plru : entity work.plru
>     generic map (
>         BITS => TLB_WAY_BITS
>         )

tlb_plru = Plru(TLB_WAY_BITS)

>     port map (
>         clk => clk,
>         rst => rst,
>         acc => tlb_plru_acc,
>         acc_en => tlb_plru_acc_en,
>         lru => tlb_plru_out
>         );

comb += tlb_plru.acc.eq(tlb_plru_acc)
..
comb += tlb_plru_out.eq(tlb_plru.lru)
Comment 14 Luke Kenneth Casson Leighton 2020-08-26 22:06:55 BST
(In reply to Luke Kenneth Casson Leighton from comment #13)

from wherever import Plru

...
...
    tlb_plru = Plru(TLB_WAY_BITS)
Comment 15 Luke Kenneth Casson Leighton 2020-08-30 12:38:39 BST
cole, only comb += defaults to equal-to-the-reset.

for sync the value is stored in a latch and so if it is not set it will
*remain* at the (incorrect) value.


    #                 r.req := d_in;
    #                 r.tlbie := '0';
    #                 r.doall := '0';
    #                 r.tlbld := '0';
    #                 r.mmu_req := '0';
                sync += r.req.eq(d_in)
                sync += r.req.tlbie.eq(0)
                sync += r.req.doall.eq(0)
                sync += r.req.tlbd.eq(0)
                sync += r.req.mmu_req.eq(0)
Comment 16 Luke Kenneth Casson Leighton 2020-08-30 12:47:45 BST
        index    = Signal(log2_int(TLB_SET_BITS), False)
        addrbits = Signal(TLB_SET_BITS)

        amin = TLB_LG_PGSZ
        amax = TLB_LG_PGSZ + TLB_SET_BITS

        with m.If(m_in.valid):
            comb += addrbits.eq(m_in.addr[amin : amax])
        with m.Else():
            comb += addrbits.eq(d_in.addr[amin : amax])
        comb += index.eq(addrbits)

        #             if r0_stall = '0' then
        # If we have any op and the previous op isn't finished,
        # then keep the same output for next cycle.
        with m.If(~r0_stall):
            sync += tlb_valid_way.eq(dtlb_valid_bits[index])
            sync += tlb_tag_way.eq(dtlb_tags[index])
            sync += tlb_pte_way.eq(dtlb_ptes[index])

previously, you had sync += addrbits and sync += index.

what this would do is:

* on the first clock cycle, m_in.valid would be tested
* on the second cycle, a value of addrbits would be valid
* on the third cycle, the value of addrbits would become valid in index
* on the fourth cycle, the test of r0_stall would finally result in
  tlb_*_way being updated...

... with values that were far, far too late.

if you look at the original dcache.vhdl you will see that index and addrbits
were "variables", not signals.  this is a sign that needs to be looked out for.
Comment 17 Luke Kenneth Casson Leighton 2020-08-30 12:55:09 BST
#                   (ROW_OFF_BITS-1 downto 0 => '0');
            comb += ra.eq(Cat(
                     Const(0, ROW_OFF_BITS),

you had a constant ROW_OFF_BITS of length ROW_OFF_BITS
Comment 18 Luke Kenneth Casson Leighton 2020-08-30 12:59:32 BST
    #         variable tagset : tlb_way_tags_t;
    #         variable pteset : tlb_way_ptes_t;
    #type tlb_tags_t is array(tlb_index_t) of tlb_way_tags_t;
    # --> Array([Signal(log(way_tags length)) for i in range(number of tlbs)])

a function called TLBWayTags() could i suppose be defined to return that
nmigen Array.
Comment 19 Luke Kenneth Casson Leighton 2020-09-07 13:14:09 BST
cole also needed is cache_ram.vhdl.
Comment 20 Luke Kenneth Casson Leighton 2020-09-07 13:22:09 BST
                comb += wr_addr.eq(r1.store_row)
                comb += wr_sel.eq(~0) # all 1s

#                 if r1.state = RELOAD_WAIT_ACK and
#                 wishbone_in.ack = '1' and replace_way = i then
            with m.If((r1.state == State.RELOAD_WAIT_ACK)
                      & wishbone_in.ack & (relpace_way == i)):

you need to make sure that any comparisons are in brackets when included
with "&" or "|" bit-wise operators.

the reason is because unlike "and" and "or", "&/|" are HIGHER precedence
than comparisons.

in python you would normally do "if x & mask == 0b110" and that would of
course be taken to mean "bit-wise AND x together THEN compare against 0b110".

so it is critically important to do "if (x & mask) == 0b110" and it is
ESPECIALLY important to do "value & (something == somethingelse)" otherwise
it is interpreted as "(value & something) == somethingelse"
Comment 21 Luke Kenneth Casson Leighton 2020-09-07 13:31:17 BST
                r1.forward_sel1 <= (others => '1');

that's not "just set the 1st bit to 1" it's "set *ALL* bits to 1".
whitequark informs us that the best way to do this is:

            sync += r1.forward_sel1.eq(~0) # all 1s

*definitely* not

            sync += r1.forward_sel1.eq(1) # sets only the LSB to 1.
Comment 22 Luke Kenneth Casson Leighton 2020-09-07 13:44:49 BST
#                     when OP_STORE_HIT | OP_STORE_MISS =>
                        with m.Case(Op.OP_STORE_HIT, Op.OP_STORE_MISS):

cases are not ORed together, they are comma-listed.
Comment 23 Luke Kenneth Casson Leighton 2020-09-07 14:01:29 BST
#               stbs_done := true;
                                sync += stbs_done.eq(0)


error.  should be eq(1)
Comment 24 Cole Poirier 2020-09-09 13:46:14 BST
Thank you for all the corrections luke *very* helpful! I actually saw all of these last week but wasn't at a computer and able to respond here on bugzilla. Will be reply 'noted.' to each comment that I had seen last week and thus tried to adhere to for the.. I'd say second half of icache.py's translation from icache.vhdl. I plan on going over all of mmu.py, dcache.py, and icache.py with my new understanding after finishing the translation of the last two processes from icache.vhdl.

(In reply to Luke Kenneth Casson Leighton from comment #15)
> cole, only comb += defaults to equal-to-the-reset.
> 
> for sync the value is stored in a latch and so if it is not set it will
> *remain* at the (incorrect) value.
> 
> 
>     #                 r.req := d_in;
>     #                 r.tlbie := '0';
>     #                 r.doall := '0';
>     #                 r.tlbld := '0';
>     #                 r.mmu_req := '0';
>                 sync += r.req.eq(d_in)
>                 sync += r.req.tlbie.eq(0)
>                 sync += r.req.doall.eq(0)
>                 sync += r.req.tlbd.eq(0)
>                 sync += r.req.mmu_req.eq(0)

Noted.

(In reply to Luke Kenneth Casson Leighton from comment #16)
>         index    = Signal(log2_int(TLB_SET_BITS), False)
>         addrbits = Signal(TLB_SET_BITS)
> 
>         amin = TLB_LG_PGSZ
>         amax = TLB_LG_PGSZ + TLB_SET_BITS
> 
>         with m.If(m_in.valid):
>             comb += addrbits.eq(m_in.addr[amin : amax])
>         with m.Else():
>             comb += addrbits.eq(d_in.addr[amin : amax])
>         comb += index.eq(addrbits)
> 
>         #             if r0_stall = '0' then
>         # If we have any op and the previous op isn't finished,
>         # then keep the same output for next cycle.
>         with m.If(~r0_stall):
>             sync += tlb_valid_way.eq(dtlb_valid_bits[index])
>             sync += tlb_tag_way.eq(dtlb_tags[index])
>             sync += tlb_pte_way.eq(dtlb_ptes[index])
> 
> previously, you had sync += addrbits and sync += index.
> 
> what this would do is:
> 
> * on the first clock cycle, m_in.valid would be tested
> * on the second cycle, a value of addrbits would be valid
> * on the third cycle, the value of addrbits would become valid in index
> * on the fourth cycle, the test of r0_stall would finally result in
>   tlb_*_way being updated...
> 
> ... with values that were far, far too late.
> 
> if you look at the original dcache.vhdl you will see that index and addrbits
> were "variables", not signals.  this is a sign that needs to be looked out
> for.

Hmmm. Very interesting! This is probably going to be the most difficult one for me to do properly. This was within an "if rising_edge(clk) then" statement correct? Is it the case that only signals are used with sync and variables are used with comb within a sync block?

(In reply to Luke Kenneth Casson Leighton from comment #17)
> #                   (ROW_OFF_BITS-1 downto 0 => '0');
>             comb += ra.eq(Cat(
>                      Const(0, ROW_OFF_BITS),
> 
> you had a constant ROW_OFF_BITS of length ROW_OFF_BITS

(In reply to Luke Kenneth Casson Leighton from comment #18)
>     #         variable tagset : tlb_way_tags_t;
>     #         variable pteset : tlb_way_ptes_t;
>     #type tlb_tags_t is array(tlb_index_t) of tlb_way_tags_t;
>     # --> Array([Signal(log(way_tags length)) for i in range(number of
> tlbs)])
> 
> a function called TLBWayTags() could i suppose be defined to return that
> nmigen Array.

Noted. I do this for a lot if not all the types that are "is array(*)" statements.

(In reply to Luke Kenneth Casson Leighton from comment #19)
> cole also needed is cache_ram.vhdl.

(In reply to Luke Kenneth Casson Leighton from comment #20)
>                 comb += wr_addr.eq(r1.store_row)
>                 comb += wr_sel.eq(~0) # all 1s
> 
> #                 if r1.state = RELOAD_WAIT_ACK and
> #                 wishbone_in.ack = '1' and replace_way = i then
>             with m.If((r1.state == State.RELOAD_WAIT_ACK)
>                       & wishbone_in.ack & (relpace_way == i)):
> 
> you need to make sure that any comparisons are in brackets when included
> with "&" or "|" bit-wise operators.
> 
> the reason is because unlike "and" and "or", "&/|" are HIGHER precedence
> than comparisons.
> 
> in python you would normally do "if x & mask == 0b110" and that would of
> course be taken to mean "bit-wise AND x together THEN compare against 0b110".
> 
> so it is critically important to do "if (x & mask) == 0b110" and it is
> ESPECIALLY important to do "value & (something == somethingelse)" otherwise
> it is interpreted as "(value & something) == somethingelse"

Very helpful. Noted. I will have to review the first half of icache.py as well as mmu and dcache for this specifically later.

(In reply to Luke Kenneth Casson Leighton from comment #21)
>                 r1.forward_sel1 <= (others => '1');
> 
> that's not "just set the 1st bit to 1" it's "set *ALL* bits to 1".
> whitequark informs us that the best way to do this is:
> 
>             sync += r1.forward_sel1.eq(~0) # all 1s
> 
> *definitely* not
> 
>             sync += r1.forward_sel1.eq(1) # sets only the LSB to 1.

Ahhhh. *THAT'S* what (others => 0 *or* 1) meant! Thank you, makes so much sense now.

(In reply to Luke Kenneth Casson Leighton from comment #22)
> #                     when OP_STORE_HIT | OP_STORE_MISS =>
>                         with m.Case(Op.OP_STORE_HIT, Op.OP_STORE_MISS):
> 
> cases are not ORed together, they are comma-listed.

Ah thank you, will note for my full review later today.

(In reply to Luke Kenneth Casson Leighton from comment #23)
> #               stbs_done := true;
>                                 sync += stbs_done.eq(0)
> 
> 
> error.  should be eq(1)

Thank you.
Comment 25 Luke Kenneth Casson Leighton 2020-09-09 14:13:53 BST
(In reply to Cole Poirier from comment #24)

> > if you look at the original dcache.vhdl you will see that index and addrbits
> > were "variables", not signals.  this is a sign that needs to be looked out
> > for.
> 
> Hmmm. Very interesting! This is probably going to be the most difficult one
> for me to do properly. This was within an "if rising_edge(clk) then"
> statement correct? 

yes.  it's.. tricky.

> Is it the case that only signals are used with sync and
> variables are used with comb within a sync block?

kinda although it's just not that straightfoward unfortunately.
"variables" do not
exist in nmigen as a concept: there are only Signals.  plus, the "sync"
domain acts differently from "if rising edge" in VHDL, although they
are similar.

basically everything in that "rising edge" block has to happen actually
at that "tick".

where in nmigen, the concept of a separate hidden "next state" is created
by sync and it is UTTERLY SEPARATE AND INACCESSIBLE to the current "tick"
because it is, quite literally, to become the "future" state.

i explain this more in comment #16.
Comment 26 Cole Poirier 2020-09-09 15:15:13 BST
(In reply to Luke Kenneth Casson Leighton from comment #25)
> (In reply to Cole Poirier from comment #24)
> 
> > > if you look at the original dcache.vhdl you will see that index and addrbits
> > > were "variables", not signals.  this is a sign that needs to be looked out
> > > for.
> > 
> > Hmmm. Very interesting! This is probably going to be the most difficult one
> > for me to do properly. This was within an "if rising_edge(clk) then"
> > statement correct? 
> 
> yes.  it's.. tricky.

Indeed :O

> > Is it the case that only signals are used with sync and
> > variables are used with comb within a sync block?
> 
> kinda although it's just not that straightfoward unfortunately.
> "variables" do not
> exist in nmigen as a concept: there are only Signals.  plus, the "sync"
> domain acts differently from "if rising edge" in VHDL, although they
> are similar.
> 
> basically everything in that "rising edge" block has to happen actually
> at that "tick".
> 
> where in nmigen, the concept of a separate hidden "next state" is created
> by sync and it is UTTERLY SEPARATE AND INACCESSIBLE to the current "tick"
> because it is, quite literally, to become the "future" state.
> 
> i explain this more in comment #16.

That's very interesting and more than a little brain straining. I reviewed the diff of that commit using gitk (very cool tool by the way), it was very instructive, though given I still only partially understand the subtle differences between vhdl 'if rising_edge(clk) then' and nmigen 'sync' domains, I expect it's likely that you may have to correct me on this again :)
Comment 27 Luke Kenneth Casson Leighton 2020-09-10 18:38:43 BST
right, i've gone over dcache.py about 5 times now, just added a set of calls to string everything together, added cache_ram.py, last one is plru.py and then actually instantiating the module can be done and syntax errors tracked down.

for all of this, cole, i'm considering dropping in quite a big bit of the MMU budget, around the EUR 1600 mark, 50:50 because it's a heck of a lot of work.
Comment 28 Cole Poirier 2020-09-10 23:51:07 BST
(In reply to Luke Kenneth Casson Leighton from comment #27)
> right, i've gone over dcache.py about 5 times now, just added a set of calls
> to string everything together, added cache_ram.py, last one is plru.py and
> then actually instantiating the module can be done and syntax errors tracked
> down.

Cool! Thanks for all of your help, I'm learning an amazing amount in quite a short time due to that :)

> for all of this, cole, i'm considering dropping in quite a big bit of the
> MMU budget, around the EUR 1600 mark, 50:50 because it's a heck of a lot of
> work.

Sounds reasonable to me, the translation from vhdl was the easy part, linking it up properly and testing it should be quite the fun process ;)
Comment 29 Luke Kenneth Casson Leighton 2020-09-11 01:16:59 BST
well, it generates an il file, which is sbsolutely massive.  this is likely due to use of double nested arrays: bit_select and word_select on an already-indexed item.

the solution there is to go over them all carefully, first storing the 1st level indexed item in a temporary Signal followed second by updating the secondary level in the temporary.
Comment 30 Cole Poirier 2020-09-11 01:26:35 BST
(In reply to Luke Kenneth Casson Leighton from comment #29)
> well, it generates an il file, which is sbsolutely massive.  this is likely
> due to use of double nested arrays: bit_select and word_select on an
> already-indexed item.
> 
> the solution there is to go over them all carefully, first storing the 1st
> level indexed item in a temporary Signal followed second by updating the
> secondary level in the temporary.

Interesting. Lots of pmux's then? Can you show me an example of this pattern here?
Comment 31 Luke Kenneth Casson Leighton 2020-09-11 01:43:31 BST
(In reply to Cole Poirier from comment #30)

> Interesting. Lots of pmux's then?

no, much worse: massive arrays of arrays of pmuxes with hundreds of inputs

> Can you show me an example of this pattern
> here?

vvvvv
> > due to use of double nested arrays: bit_select and word_select on an
^^^^^^
Comment 32 Cole Poirier 2020-09-11 01:50:00 BST
(In reply to Luke Kenneth Casson Leighton from comment #31)
> (In reply to Cole Poirier from comment #30)
> 
> > Interesting. Lots of pmux's then?
> 
> no, much worse: massive arrays of arrays of pmuxes with hundreds of inputs

Ohhh boy...

> > Can you show me an example of this pattern
> > here?
> 
> vvvvv
> > > due to use of double nested arrays: bit_select and word_select on an
> ^^^^^^

Sorry I understood that part, I was insufficiently precise. I meant of the pattern to fix them re 
```
(In reply to Cole Poirier from comment #30)
> (In reply to Luke Kenneth Casson Leighton from comment #29)

> > the solution there is to go over them all carefully, first storing the 1st
> > level indexed item in a temporary Signal followed second by updating the
> > secondary level in the temporary.
```
Comment 33 Luke Kenneth Casson Leighton 2020-09-11 04:22:51 BST
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/dcache.py;hb=HEAD#l672

brackets needed round this line
Comment 34 Luke Kenneth Casson Leighton 2020-09-11 11:26:47 BST
commit 9e2ae481cb2df9f4da44a1320ac446db550f753c (HEAD -> master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Fri Sep 11 11:26:23 2020 +0100

    add brackets round if & in dcache
Comment 35 Luke Kenneth Casson Leighton 2020-09-11 21:47:27 BST
currently running with this patch:
https://gist.githubusercontent.com/jeanthom/f97f5b928720d4adda9d295e8a5bc078/raw/694274e0aceec993c0fc127e296b1a85b93c1b89/nmigen-display.diff

allows for debug info to be printed out
Comment 36 Luke Kenneth Casson Leighton 2020-09-12 17:11:20 BST
(In reply to Cole Poirier from comment #0)

> ***Must follow up on nmigen SRAM attributes***

bit of research shows that "ram_style" is an attribute hint passed through to synthesis.

further, that it is possible to pass a dictionary "attrs" to nmigen Signals.
Comment 37 Luke Kenneth Casson Leighton 2021-04-26 16:37:30 BST
http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-April/002444.html

i have a suspicion as to what is going on.  that there is not a problem
with dcache.py: instead it is part of Wishbone WB3 (simple) WB4 (pipeline)
interaction.

WB4 spec, section 5.3

5.3     Simple slaves directly supporting pipelined mode

Even though not initially designed for pipelined mode many simple wishbone slaves
directly support this mode. For this to be true the following characteristics must be true

   1. [ACK_O] should be registered
   2. a read or write classic bus cycle should always occupy two clock cycles
   3. actual write to internal register must be done on rising clock edge 1.
       See figure 5-3 below

Master input signal [STALL_I] should be tied low, inactive, statically.
Comment 38 Cesar Strauss 2021-05-01 23:36:56 BST
I think this solution, which is commented out in dcache.py, is actually correct:

comb += self.wb_in.stall.eq(self.wb_out.cyc & ~self.wb_in.ack)

This interfaces a Classic slave with a Pipelined master (Wishbone B.4 5.2.1)

The reason it doesn't work, as I see it, is because the SRAM is not handling the Classical cycle correctly.

In sram.py, the following change is needed:

         # generate ack (no "pipeline" mode here)
         m.d.sync += self.bus.ack.eq(0)
-        with m.If(self.bus.cyc & self.bus.stb):
+        with m.If(self.bus.cyc & self.bus.stb & ~self.bus.ack):
             m.d.sync += self.bus.ack.eq(1)

Otherwise, in a Classic cycle, it will send an extra ack.

I will take a closer look at the SRAM test cases, in src/soc/bus/test/test_sram_wishbone.py, to confirm it.

With both of the above changes, the DCache still works, but has a wait cycle in every transfer, due to limitations of the Classical cycle.

On the other hand, we could keep it as is, but then we need to add a zeroed stall signal to the SRAM wishbone interface, since we are effectively in Pipeline mode. This is why the current DCache code works. But other masters, that expect a Classical interface from the SRAM, may error out due to the extra ack. 

Anyway, I think it would be much better if we could have an SRAM with Pipelined mode, to avoid the extra wait states, and convert it to Classical when really needed.
Comment 39 Luke Kenneth Casson Leighton 2021-05-02 00:03:15 BST
(In reply to Cesar Strauss from comment #38)
> I think this solution, which is commented out in dcache.py, is actually
> correct:
> 
> comb += self.wb_in.stall.eq(self.wb_out.cyc & ~self.wb_in.ack)
> 
> This interfaces a Classic slave with a Pipelined master (Wishbone B.4 5.2.1)
> 
> The reason it doesn't work, as I see it, is because the SRAM is not handling
> the Classical cycle correctly.

ahh

> With both of the above changes, the DCache still works, but has a wait cycle
> in every transfer, due to limitations of the Classical cycle.

that's not a problem right now.

> On the other hand, we could keep it as is, but then we need to add a zeroed
> stall signal to the SRAM wishbone interface, since we are effectively in
> Pipeline mode. 

i would like to have some unit tests where the SRAM can randomly set the stall condition.

> Anyway, I think it would be much better if we could have an SRAM with
> Pipelined mode, to avoid the extra wait states, and convert it to Classical
> when really needed.

yes.  this can be done by detecting if the  wb bus has a stall record attribute

  if hasattr(bus, "stall"):
     ...
Comment 40 Luke Kenneth Casson Leighton 2021-05-02 00:03:26 BST

    
Comment 41 Luke Kenneth Casson Leighton 2021-05-02 12:46:37 BST
currently with those ack / stall fixes the litexsim bios can at least report on-screen startup.  the CRC32 however fails.

earlier testing last year had a similar issue, so i inserted debug prints which allowed tracking of CRC calculation.

a comparison with and without dcache should help.
Comment 42 Luke Kenneth Casson Leighton 2021-05-02 19:35:38 BST
i just added an option to the load/store functions to set "nc" (no cache).  some unit tests need to be added for this.

also we need much larger (wider) memory tests (32 bit addresses), here it would be better to use the wb_get function used elsewhere, which will cope with storing data sets in a dictionary (as a sparse array) rather than a massive flat list which will run out of memory.
Comment 43 Luke Kenneth Casson Leighton 2021-05-07 04:12:45 BST
an assessment on how to remove one clock cycle latency on response from the cache srams

http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-cores/2021-May/000262.html