The goal is to achieve a working D-cache by translating microwatt's vhdl dcache code ***Must follow up on nmigen SRAM attributes*** also TODO: cache_ram.vhdl https://github.com/paulusmack/microwatt/blob/f636bb7c3999d9326a2bd1c6131fc128be2cae24/dcache.vhdl
Sorry, I just now committed the work I've done on this translation from yesterday and today. I was working yesterday, and had to leave the house immediately once I noticed the time in order to not miss an appointment, so I didn't have time to properly review my git diff and write a commit message. Again, apologies. Luke, can you give me some guidance on this specifically relating to tranlading vhdl generate blocks, port maps, and generic maps into nmigen? (and a few other things that I've skipped due to lack of understanding but are marked in the file with comments that contain the keyword TODO). Thanks in advance, as I will see your comments in the morning :)
done some, you will need to read, extrapolate and resolve others. * range(THE_CONSTANT) not range (SomeFunctionReturningASignal()) * remove all default values of all function parameter arguments * remove "severity failure" and put comments either back on one line or use backslash and two strings, one per line
# subtype way_t is integer range 0 to NUM_WAYS-1; therefore wherever you *see* way_t, replace it with "NUM_WAYS". # hit_way : way_t; self.hit_way = Signal(WAY) no - this needs to be: self.hit_way = Signal(NUM_WAYS) likewise: # forward_row1 : row_t; self.forward_row1 = Signal(ROW) no - look at the definition of row_t: # subtype row_t is integer range 0 to BRAM_ROWS-1; therefore that needs to be: self.forward_row1 = Signal(BRAM_ROWS) etc. etc. exactly the same for INDEX, and probably CACHE_TAG as well.
right, ok, not quite correct. log2_int(NUM_WAYS). etc. i suggest creating some constante szNUM_WAYS = log2...etc
(In reply to Luke Kenneth Casson Leighton from comment #3) > # subtype way_t is integer range 0 to NUM_WAYS-1; > > > therefore wherever you *see* way_t, replace it with "NUM_WAYS". > > # hit_way : way_t; > > self.hit_way = Signal(WAY) > > no - this needs to be: > > self.hit_way = Signal(NUM_WAYS) > > likewise: > > # forward_row1 : row_t; > > self.forward_row1 = Signal(ROW) > > no - look at the definition of row_t: > > # subtype row_t is integer range 0 to BRAM_ROWS-1; > > therefore that needs to be: > > self.forward_row1 = Signal(BRAM_ROWS) > > etc. etc. exactly the same for INDEX, and probably CACHE_TAG as well. So each subtype unless it is an array is simply an aliasing of the constant it specifies directly or through one or more levels of subtyping indirection?
(In reply to Luke Kenneth Casson Leighton from comment #4) > right, ok, not quite correct. > log2_int(NUM_WAYS). > > etc. > > i suggest creating some constante szNUM_WAYS = log2...etc Sorry I'm having trouble understanding what you're trying to convey here. Can you please explain it a bit more verbosely?
(In reply to Cole Poirier from comment #1) > Luke, can you give me some guidance on this specifically relating to > tranlading vhdl generate blocks, port maps, and generic maps into nmigen? > (and a few other things that I've skipped due to lack of understanding but > are marked in the file with comments that contain the keyword TODO). I've finished my first translation pass of dcache.vhdl -> dcache.py. As I stated in this commit I've left around 5% undone as they involved generate and instance statements that I couldn't figure out how to translate and would like some help and guidance on. As I said in my earlier comment (quoted above), this also include inputs to modules in the form of port and generic, maps. Additionally I'm 100% sure I'm not setting up the classes/modules right, and looking at mmu.py gives me a good idea of how to fix this. I just don't have the energy left to do this today. So, can you provide some corrections and help with the things I've asked for (marked in the file with comments containing the keyword TODO), and leave for me to make an attempt at tomorrow: the fixing of the arrangment of things I created classes for into functions as they are in mmu.py? Thanks, as always for your help and guidance :)
szNUM_WAYS = log2_int(NUM_WAYS) self.hit_way = Signal(szNUM_WAYS)
(In reply to Cole Poirier from comment #7) > Thanks, as always for your help and guidance :) 3am at the moment, will do some tomorrow.
(In reply to Luke Kenneth Casson Leighton from comment #8) > > szNUM_WAYS = log2_int(NUM_WAYS) > self.hit_way = Signal(szNUM_WAYS) commit b8e1e712b6aa41fa2f084bceb808672804a4abdc (HEAD -> master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Mon Aug 24 12:37:32 2020 +0100 use WAY_BITS in appropriate locations ok. * numbers are in binary and way_t is declared as 0..NUM_WAYS-1 * therefore it takes log2_int(NUM_WAYS) bits to *store* a number in a Signal that will cover all values 0..NUM_WAYS-1 * turns out that WAY_BITS has already been assigned to log2_int(NUM_WAYS). * therefore Signals of type "way_t" must be declared as Signal(WAY_BITS) you'll need to be very patient and focussed, going step-by-step and blocking out everything other than what you are focussing on. bear in mind that a for-loop from 0 to NUM_WAYS-1 is *not* to replaced with WAY_BITS i.e. this is... errr... not correct # for i in way_t loop for i in range(WAY): _this_ is correct: # for i in way_t loop for i in range(NUM_WAYS): commit b45fab94d8370ef8297b9bb303bf0d6fc161193e (HEAD -> master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Mon Aug 24 12:43:34 2020 +0100 "WAY" does not exist - range(NUM_WAYS) was intended
Ok, done all the changes you recommended, and rearranged the file to conform to the example you showed by transmuting my original mmu.py into the current form of mmu.py. Now I *really* need your help with the substance of this module as there are things I simply do not know how to do, all such things are marked with a comment containing the keyword TODO.
Hi Luke, I'm struggling with translating the generate statements and generic/port maps, can you give me some feedback on whether the following is a correct translation? Thank you :) ```dcache.vhdl -- Generate TLB PLRUs maybe_tlb_plrus: if TLB_NUM_WAYS > 1 generate begin tlb_plrus: for i in 0 to TLB_SET_SIZE - 1 generate -- TLB PLRU interface signal tlb_plru_acc : std_ulogic_vector(TLB_WAY_BITS-1 downto 0); signal tlb_plru_acc_en : std_ulogic; signal tlb_plru_out : std_ulogic_vector(TLB_WAY_BITS-1 downto 0); begin tlb_plru : entity work.plru generic map ( BITS => TLB_WAY_BITS ) port map ( clk => clk, rst => rst, acc => tlb_plru_acc, acc_en => tlb_plru_acc_en, lru => tlb_plru_out ); process(all) begin -- PLRU interface if r1.tlb_hit_index = i then tlb_plru_acc_en <= r1.tlb_hit; else tlb_plru_acc_en <= '0'; end if; tlb_plru_acc <= std_ulogic_vector(to_unsigned(r1.tlb_hit_way,TLB_WAY_BITS)); tlb_plru_victim(i) <= tlb_plru_out; end process; end generate; end generate; ``` ```dcache.py # Generate TLB PLRUs def maybe_tlb_plrus(self, m, r1, tlb_plru_victim, acc, acc_en, lru): comb = m.d.comb sync = m.d.sync with m.If(TLB_NUM_WAYS > 1): for i in range(TLB_SET_SIZE): # TLB PLRU interface comb += tlb_plru.eq(object) comb += tlb_plru.input.tlb_plru_acc.eq( Signal(TLB_WAY_BITS) ) comb += tlb_plru.input.tlb_plru_acc_en.eq(Signal()) comb += tlb_plru.input.tlb_plru_out.eq( Signal(TLB_WAY_BITS) ) comb += tlb_plru.input.tlb_plru_acc.eq(acc) comb += tlb_plru.input.tlb_plru_acc_en.eq(acc_en) comb += tlb_plru.input.tlb_plru_out.eq(lru) with m.If(r1.tlb_hit_index == i): comb += tlb_plru.input.tlb_plru_acc_en.eq( r1.tlb_hit ) with m.Else(): comb += tlb_plru.tlb_plru_acc_en.eq(0) comb += tlb_plru.input.tlb_plru_acc.eq( r1.tlb_hit_way ) comb += tlb_plru_victim[i].eq( tlb_plru.input.tlb_plru_out ) ```
(In reply to Cole Poirier from comment #12) > Hi Luke, I'm struggling with translating the generate statements and > generic/port maps, can you give me some feedback on whether the following is > a correct translation? Thank you :) > > ```dcache.vhdl > -- Generate TLB PLRUs > maybe_tlb_plrus: if TLB_NUM_WAYS > 1 generate > begin > tlb_plrus: for i in 0 to TLB_SET_SIZE - 1 generate > -- TLB PLRU interface > signal tlb_plru_acc : std_ulogic_vector(TLB_WAY_BITS-1 downto 0); > signal tlb_plru_acc_en : std_ulogic; > signal tlb_plru_out : std_ulogic_vector(TLB_WAY_BITS-1 downto 0); > begin > tlb_plru : entity work.plru > generic map ( > BITS => TLB_WAY_BITS > ) tlb_plru = Plru(TLB_WAY_BITS) > port map ( > clk => clk, > rst => rst, > acc => tlb_plru_acc, > acc_en => tlb_plru_acc_en, > lru => tlb_plru_out > ); comb += tlb_plru.acc.eq(tlb_plru_acc) .. comb += tlb_plru_out.eq(tlb_plru.lru)
(In reply to Luke Kenneth Casson Leighton from comment #13) from wherever import Plru ... ... tlb_plru = Plru(TLB_WAY_BITS)
cole, only comb += defaults to equal-to-the-reset. for sync the value is stored in a latch and so if it is not set it will *remain* at the (incorrect) value. # r.req := d_in; # r.tlbie := '0'; # r.doall := '0'; # r.tlbld := '0'; # r.mmu_req := '0'; sync += r.req.eq(d_in) sync += r.req.tlbie.eq(0) sync += r.req.doall.eq(0) sync += r.req.tlbd.eq(0) sync += r.req.mmu_req.eq(0)
index = Signal(log2_int(TLB_SET_BITS), False) addrbits = Signal(TLB_SET_BITS) amin = TLB_LG_PGSZ amax = TLB_LG_PGSZ + TLB_SET_BITS with m.If(m_in.valid): comb += addrbits.eq(m_in.addr[amin : amax]) with m.Else(): comb += addrbits.eq(d_in.addr[amin : amax]) comb += index.eq(addrbits) # if r0_stall = '0' then # If we have any op and the previous op isn't finished, # then keep the same output for next cycle. with m.If(~r0_stall): sync += tlb_valid_way.eq(dtlb_valid_bits[index]) sync += tlb_tag_way.eq(dtlb_tags[index]) sync += tlb_pte_way.eq(dtlb_ptes[index]) previously, you had sync += addrbits and sync += index. what this would do is: * on the first clock cycle, m_in.valid would be tested * on the second cycle, a value of addrbits would be valid * on the third cycle, the value of addrbits would become valid in index * on the fourth cycle, the test of r0_stall would finally result in tlb_*_way being updated... ... with values that were far, far too late. if you look at the original dcache.vhdl you will see that index and addrbits were "variables", not signals. this is a sign that needs to be looked out for.
# (ROW_OFF_BITS-1 downto 0 => '0'); comb += ra.eq(Cat( Const(0, ROW_OFF_BITS), you had a constant ROW_OFF_BITS of length ROW_OFF_BITS
# variable tagset : tlb_way_tags_t; # variable pteset : tlb_way_ptes_t; #type tlb_tags_t is array(tlb_index_t) of tlb_way_tags_t; # --> Array([Signal(log(way_tags length)) for i in range(number of tlbs)]) a function called TLBWayTags() could i suppose be defined to return that nmigen Array.
cole also needed is cache_ram.vhdl.
comb += wr_addr.eq(r1.store_row) comb += wr_sel.eq(~0) # all 1s # if r1.state = RELOAD_WAIT_ACK and # wishbone_in.ack = '1' and replace_way = i then with m.If((r1.state == State.RELOAD_WAIT_ACK) & wishbone_in.ack & (relpace_way == i)): you need to make sure that any comparisons are in brackets when included with "&" or "|" bit-wise operators. the reason is because unlike "and" and "or", "&/|" are HIGHER precedence than comparisons. in python you would normally do "if x & mask == 0b110" and that would of course be taken to mean "bit-wise AND x together THEN compare against 0b110". so it is critically important to do "if (x & mask) == 0b110" and it is ESPECIALLY important to do "value & (something == somethingelse)" otherwise it is interpreted as "(value & something) == somethingelse"
r1.forward_sel1 <= (others => '1'); that's not "just set the 1st bit to 1" it's "set *ALL* bits to 1". whitequark informs us that the best way to do this is: sync += r1.forward_sel1.eq(~0) # all 1s *definitely* not sync += r1.forward_sel1.eq(1) # sets only the LSB to 1.
# when OP_STORE_HIT | OP_STORE_MISS => with m.Case(Op.OP_STORE_HIT, Op.OP_STORE_MISS): cases are not ORed together, they are comma-listed.
# stbs_done := true; sync += stbs_done.eq(0) error. should be eq(1)
Thank you for all the corrections luke *very* helpful! I actually saw all of these last week but wasn't at a computer and able to respond here on bugzilla. Will be reply 'noted.' to each comment that I had seen last week and thus tried to adhere to for the.. I'd say second half of icache.py's translation from icache.vhdl. I plan on going over all of mmu.py, dcache.py, and icache.py with my new understanding after finishing the translation of the last two processes from icache.vhdl. (In reply to Luke Kenneth Casson Leighton from comment #15) > cole, only comb += defaults to equal-to-the-reset. > > for sync the value is stored in a latch and so if it is not set it will > *remain* at the (incorrect) value. > > > # r.req := d_in; > # r.tlbie := '0'; > # r.doall := '0'; > # r.tlbld := '0'; > # r.mmu_req := '0'; > sync += r.req.eq(d_in) > sync += r.req.tlbie.eq(0) > sync += r.req.doall.eq(0) > sync += r.req.tlbd.eq(0) > sync += r.req.mmu_req.eq(0) Noted. (In reply to Luke Kenneth Casson Leighton from comment #16) > index = Signal(log2_int(TLB_SET_BITS), False) > addrbits = Signal(TLB_SET_BITS) > > amin = TLB_LG_PGSZ > amax = TLB_LG_PGSZ + TLB_SET_BITS > > with m.If(m_in.valid): > comb += addrbits.eq(m_in.addr[amin : amax]) > with m.Else(): > comb += addrbits.eq(d_in.addr[amin : amax]) > comb += index.eq(addrbits) > > # if r0_stall = '0' then > # If we have any op and the previous op isn't finished, > # then keep the same output for next cycle. > with m.If(~r0_stall): > sync += tlb_valid_way.eq(dtlb_valid_bits[index]) > sync += tlb_tag_way.eq(dtlb_tags[index]) > sync += tlb_pte_way.eq(dtlb_ptes[index]) > > previously, you had sync += addrbits and sync += index. > > what this would do is: > > * on the first clock cycle, m_in.valid would be tested > * on the second cycle, a value of addrbits would be valid > * on the third cycle, the value of addrbits would become valid in index > * on the fourth cycle, the test of r0_stall would finally result in > tlb_*_way being updated... > > ... with values that were far, far too late. > > if you look at the original dcache.vhdl you will see that index and addrbits > were "variables", not signals. this is a sign that needs to be looked out > for. Hmmm. Very interesting! This is probably going to be the most difficult one for me to do properly. This was within an "if rising_edge(clk) then" statement correct? Is it the case that only signals are used with sync and variables are used with comb within a sync block? (In reply to Luke Kenneth Casson Leighton from comment #17) > # (ROW_OFF_BITS-1 downto 0 => '0'); > comb += ra.eq(Cat( > Const(0, ROW_OFF_BITS), > > you had a constant ROW_OFF_BITS of length ROW_OFF_BITS (In reply to Luke Kenneth Casson Leighton from comment #18) > # variable tagset : tlb_way_tags_t; > # variable pteset : tlb_way_ptes_t; > #type tlb_tags_t is array(tlb_index_t) of tlb_way_tags_t; > # --> Array([Signal(log(way_tags length)) for i in range(number of > tlbs)]) > > a function called TLBWayTags() could i suppose be defined to return that > nmigen Array. Noted. I do this for a lot if not all the types that are "is array(*)" statements. (In reply to Luke Kenneth Casson Leighton from comment #19) > cole also needed is cache_ram.vhdl. (In reply to Luke Kenneth Casson Leighton from comment #20) > comb += wr_addr.eq(r1.store_row) > comb += wr_sel.eq(~0) # all 1s > > # if r1.state = RELOAD_WAIT_ACK and > # wishbone_in.ack = '1' and replace_way = i then > with m.If((r1.state == State.RELOAD_WAIT_ACK) > & wishbone_in.ack & (relpace_way == i)): > > you need to make sure that any comparisons are in brackets when included > with "&" or "|" bit-wise operators. > > the reason is because unlike "and" and "or", "&/|" are HIGHER precedence > than comparisons. > > in python you would normally do "if x & mask == 0b110" and that would of > course be taken to mean "bit-wise AND x together THEN compare against 0b110". > > so it is critically important to do "if (x & mask) == 0b110" and it is > ESPECIALLY important to do "value & (something == somethingelse)" otherwise > it is interpreted as "(value & something) == somethingelse" Very helpful. Noted. I will have to review the first half of icache.py as well as mmu and dcache for this specifically later. (In reply to Luke Kenneth Casson Leighton from comment #21) > r1.forward_sel1 <= (others => '1'); > > that's not "just set the 1st bit to 1" it's "set *ALL* bits to 1". > whitequark informs us that the best way to do this is: > > sync += r1.forward_sel1.eq(~0) # all 1s > > *definitely* not > > sync += r1.forward_sel1.eq(1) # sets only the LSB to 1. Ahhhh. *THAT'S* what (others => 0 *or* 1) meant! Thank you, makes so much sense now. (In reply to Luke Kenneth Casson Leighton from comment #22) > # when OP_STORE_HIT | OP_STORE_MISS => > with m.Case(Op.OP_STORE_HIT, Op.OP_STORE_MISS): > > cases are not ORed together, they are comma-listed. Ah thank you, will note for my full review later today. (In reply to Luke Kenneth Casson Leighton from comment #23) > # stbs_done := true; > sync += stbs_done.eq(0) > > > error. should be eq(1) Thank you.
(In reply to Cole Poirier from comment #24) > > if you look at the original dcache.vhdl you will see that index and addrbits > > were "variables", not signals. this is a sign that needs to be looked out > > for. > > Hmmm. Very interesting! This is probably going to be the most difficult one > for me to do properly. This was within an "if rising_edge(clk) then" > statement correct? yes. it's.. tricky. > Is it the case that only signals are used with sync and > variables are used with comb within a sync block? kinda although it's just not that straightfoward unfortunately. "variables" do not exist in nmigen as a concept: there are only Signals. plus, the "sync" domain acts differently from "if rising edge" in VHDL, although they are similar. basically everything in that "rising edge" block has to happen actually at that "tick". where in nmigen, the concept of a separate hidden "next state" is created by sync and it is UTTERLY SEPARATE AND INACCESSIBLE to the current "tick" because it is, quite literally, to become the "future" state. i explain this more in comment #16.
(In reply to Luke Kenneth Casson Leighton from comment #25) > (In reply to Cole Poirier from comment #24) > > > > if you look at the original dcache.vhdl you will see that index and addrbits > > > were "variables", not signals. this is a sign that needs to be looked out > > > for. > > > > Hmmm. Very interesting! This is probably going to be the most difficult one > > for me to do properly. This was within an "if rising_edge(clk) then" > > statement correct? > > yes. it's.. tricky. Indeed :O > > Is it the case that only signals are used with sync and > > variables are used with comb within a sync block? > > kinda although it's just not that straightfoward unfortunately. > "variables" do not > exist in nmigen as a concept: there are only Signals. plus, the "sync" > domain acts differently from "if rising edge" in VHDL, although they > are similar. > > basically everything in that "rising edge" block has to happen actually > at that "tick". > > where in nmigen, the concept of a separate hidden "next state" is created > by sync and it is UTTERLY SEPARATE AND INACCESSIBLE to the current "tick" > because it is, quite literally, to become the "future" state. > > i explain this more in comment #16. That's very interesting and more than a little brain straining. I reviewed the diff of that commit using gitk (very cool tool by the way), it was very instructive, though given I still only partially understand the subtle differences between vhdl 'if rising_edge(clk) then' and nmigen 'sync' domains, I expect it's likely that you may have to correct me on this again :)
right, i've gone over dcache.py about 5 times now, just added a set of calls to string everything together, added cache_ram.py, last one is plru.py and then actually instantiating the module can be done and syntax errors tracked down. for all of this, cole, i'm considering dropping in quite a big bit of the MMU budget, around the EUR 1600 mark, 50:50 because it's a heck of a lot of work.
(In reply to Luke Kenneth Casson Leighton from comment #27) > right, i've gone over dcache.py about 5 times now, just added a set of calls > to string everything together, added cache_ram.py, last one is plru.py and > then actually instantiating the module can be done and syntax errors tracked > down. Cool! Thanks for all of your help, I'm learning an amazing amount in quite a short time due to that :) > for all of this, cole, i'm considering dropping in quite a big bit of the > MMU budget, around the EUR 1600 mark, 50:50 because it's a heck of a lot of > work. Sounds reasonable to me, the translation from vhdl was the easy part, linking it up properly and testing it should be quite the fun process ;)
well, it generates an il file, which is sbsolutely massive. this is likely due to use of double nested arrays: bit_select and word_select on an already-indexed item. the solution there is to go over them all carefully, first storing the 1st level indexed item in a temporary Signal followed second by updating the secondary level in the temporary.
(In reply to Luke Kenneth Casson Leighton from comment #29) > well, it generates an il file, which is sbsolutely massive. this is likely > due to use of double nested arrays: bit_select and word_select on an > already-indexed item. > > the solution there is to go over them all carefully, first storing the 1st > level indexed item in a temporary Signal followed second by updating the > secondary level in the temporary. Interesting. Lots of pmux's then? Can you show me an example of this pattern here?
(In reply to Cole Poirier from comment #30) > Interesting. Lots of pmux's then? no, much worse: massive arrays of arrays of pmuxes with hundreds of inputs > Can you show me an example of this pattern > here? vvvvv > > due to use of double nested arrays: bit_select and word_select on an ^^^^^^
(In reply to Luke Kenneth Casson Leighton from comment #31) > (In reply to Cole Poirier from comment #30) > > > Interesting. Lots of pmux's then? > > no, much worse: massive arrays of arrays of pmuxes with hundreds of inputs Ohhh boy... > > Can you show me an example of this pattern > > here? > > vvvvv > > > due to use of double nested arrays: bit_select and word_select on an > ^^^^^^ Sorry I understood that part, I was insufficiently precise. I meant of the pattern to fix them re ``` (In reply to Cole Poirier from comment #30) > (In reply to Luke Kenneth Casson Leighton from comment #29) > > the solution there is to go over them all carefully, first storing the 1st > > level indexed item in a temporary Signal followed second by updating the > > secondary level in the temporary. ```
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/dcache.py;hb=HEAD#l672 brackets needed round this line
commit 9e2ae481cb2df9f4da44a1320ac446db550f753c (HEAD -> master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Fri Sep 11 11:26:23 2020 +0100 add brackets round if & in dcache
currently running with this patch: https://gist.githubusercontent.com/jeanthom/f97f5b928720d4adda9d295e8a5bc078/raw/694274e0aceec993c0fc127e296b1a85b93c1b89/nmigen-display.diff allows for debug info to be printed out
(In reply to Cole Poirier from comment #0) > ***Must follow up on nmigen SRAM attributes*** bit of research shows that "ram_style" is an attribute hint passed through to synthesis. further, that it is possible to pass a dictionary "attrs" to nmigen Signals.
http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-April/002444.html i have a suspicion as to what is going on. that there is not a problem with dcache.py: instead it is part of Wishbone WB3 (simple) WB4 (pipeline) interaction. WB4 spec, section 5.3 5.3 Simple slaves directly supporting pipelined mode Even though not initially designed for pipelined mode many simple wishbone slaves directly support this mode. For this to be true the following characteristics must be true 1. [ACK_O] should be registered 2. a read or write classic bus cycle should always occupy two clock cycles 3. actual write to internal register must be done on rising clock edge 1. See figure 5-3 below Master input signal [STALL_I] should be tied low, inactive, statically.
I think this solution, which is commented out in dcache.py, is actually correct: comb += self.wb_in.stall.eq(self.wb_out.cyc & ~self.wb_in.ack) This interfaces a Classic slave with a Pipelined master (Wishbone B.4 5.2.1) The reason it doesn't work, as I see it, is because the SRAM is not handling the Classical cycle correctly. In sram.py, the following change is needed: # generate ack (no "pipeline" mode here) m.d.sync += self.bus.ack.eq(0) - with m.If(self.bus.cyc & self.bus.stb): + with m.If(self.bus.cyc & self.bus.stb & ~self.bus.ack): m.d.sync += self.bus.ack.eq(1) Otherwise, in a Classic cycle, it will send an extra ack. I will take a closer look at the SRAM test cases, in src/soc/bus/test/test_sram_wishbone.py, to confirm it. With both of the above changes, the DCache still works, but has a wait cycle in every transfer, due to limitations of the Classical cycle. On the other hand, we could keep it as is, but then we need to add a zeroed stall signal to the SRAM wishbone interface, since we are effectively in Pipeline mode. This is why the current DCache code works. But other masters, that expect a Classical interface from the SRAM, may error out due to the extra ack. Anyway, I think it would be much better if we could have an SRAM with Pipelined mode, to avoid the extra wait states, and convert it to Classical when really needed.
(In reply to Cesar Strauss from comment #38) > I think this solution, which is commented out in dcache.py, is actually > correct: > > comb += self.wb_in.stall.eq(self.wb_out.cyc & ~self.wb_in.ack) > > This interfaces a Classic slave with a Pipelined master (Wishbone B.4 5.2.1) > > The reason it doesn't work, as I see it, is because the SRAM is not handling > the Classical cycle correctly. ahh > With both of the above changes, the DCache still works, but has a wait cycle > in every transfer, due to limitations of the Classical cycle. that's not a problem right now. > On the other hand, we could keep it as is, but then we need to add a zeroed > stall signal to the SRAM wishbone interface, since we are effectively in > Pipeline mode. i would like to have some unit tests where the SRAM can randomly set the stall condition. > Anyway, I think it would be much better if we could have an SRAM with > Pipelined mode, to avoid the extra wait states, and convert it to Classical > when really needed. yes. this can be done by detecting if the wb bus has a stall record attribute if hasattr(bus, "stall"): ...
currently with those ack / stall fixes the litexsim bios can at least report on-screen startup. the CRC32 however fails. earlier testing last year had a similar issue, so i inserted debug prints which allowed tracking of CRC calculation. a comparison with and without dcache should help.
i just added an option to the load/store functions to set "nc" (no cache). some unit tests need to be added for this. also we need much larger (wider) memory tests (32 bit addresses), here it would be better to use the wb_get function used elsewhere, which will cope with storing data sets in a dictionary (as a sparse array) rather than a massive flat list which will run out of memory.
an assessment on how to remove one clock cycle latency on response from the cache srams http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-cores/2021-May/000262.html
donating my amount to jacob
just verifying, the payment is going to tplaten and isn't a bug in the program you used to submit everything, luke?
(In reply to Jacob Lifshay from comment #45) > just verifying, the payment is going to tplaten and isn't a bug in the > program you used to submit everything, luke? correct. no bug.