Bug 397 - design and discuss user-tag flags in wishbone to provide phase 1 / 2 "speculative" memory accesses
Summary: design and discuss user-tag flags in wishbone to provide phase 1 / 2 "specula...
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Mac OS
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 383
  Show dependency treegraph
 
Reported: 2020-06-21 15:52 BST by Luke Kenneth Casson Leighton
Modified: 2020-07-27 17:22 BST (History)
2 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-06-21 15:52:16 BST
(see https://bugs.libre-soc.org/show_bug.cgi?id=393#c17)

yehowshua whilst you are thinking how to answer the question (about how
to fulfil the requirement of providing a memory system - general open question -
that connects to PortInterface) i thought it important to remind you that we
are doing a speculative-capable design.

in an earlier message i described the requirements, but did not receive a response.
i will therefore reiterate them here as the top-level comment of this bugreport.

the requirements are - and this is not optional - that memory requests be
subdivided into two phases:

1) checking whether the request *CAN* be completed - WITHOUT EXCEPTIONS - if
   it were to be permitted to proceed

2) allowing the memory request to proceed.

note: it is absolutely guaranteed that there will be more than one such request
at phase (1) outstanding in any given cycle.

if we do not have this two-phase system, where multiple Phase (1) requests
can be outstanding, we will be forced to fall back to *single* LOAD/STORE
operations.

and that means that performance will suck.


now.

given these requirements, can you see - can you understand and appreciate - that
designing a "simple" wishbone-based system is guaranteed not to be useful?

this *by design* as part of the simple variants of the wishbone protocol.

this is because inherently it is AUTOMATICALLY assumed, by the wishbone protocol,
that a write request - data plus address - will either:

a) complete atomically OR
b) fail atomically and
c) it is CATEGORICALLY IMPOSSIBLE to request anything else.

to reiterate:

there is *nowhere in the protocol* that allows us to communicate phase (1).

i.e. there is nowhere in the wishbone protocol that allows us to say "you need
to TELL us if this write request will either complete atomically or fail atomically
***BUT WITHOUT ACTUALLY PERFORMING THE WRITE***"

therefore, spending time designing L1 caches - especially ones that use the
"simple" wishbone protocol - is *not* what we need.


now, i have been looking at the wishbone spec B4, page 51, illustration 3-11,
and it *might* be possible for us to add a master-side STALL_O signal
(as a TGA - tag-address bit) to achieve Phase (1) / (2) discernment:

(1) CLOCK CYCLE 1 - MASTER presents:
    - ADR_O = A0
    - TGA_O = valid and stall

(2) CLOCK CYCLE 2 - SLAVE presents either:
    - TGD_I - valid  OR
    - raises ERR_I if the address is invalid

once Phase (1) is complete (the 6600 engine knows that the memory request
is *GUARANTEED* to succeed) it can drop the "stall" TGA_O bit and the
memory request *MUST* then succeed.  if the SLAVE then raises ERR_I we
need to *halt* the processor.

the problem with this protocol is that it only supports one single "Phase (1)"
request.  the entire Wishbone Bus is dedicated and locked up, dealing with
that one request.

therefore whilst it illustrates the issue, it's impractical (i.e. useless).


i have a sneaking feeling that we are going to have to design something that
allows state information to be communicated:


1) MASTER presents:
    - ADR_O = A0
    - TGA_O = valid, and stall, *and* an "identifier" (the LDST unit ID)

2) SLAVE acknowledges (and stores the request in an internal buffer, but
   *also* beginning processing - determining the cascade through L1, L2, TLB
   and MMU and so on)

3) MASTER presents:
    - ADR_O = A1
    - TGA_O with a tag indicating **different** LDST unit

4) SLAVE acknowledges and buffers just as with (2)

5) MASTER presents:
    - ADR_O = A0
    - TGA_O = "request for updated status on progress of determining if Addr is OK"

6) SLAVE acknowledges and confirms that the request with address A0 and
   for this LDST unit ID that *if* the MASTER were to present a request
   that the operation take place, it *would* succeed 100%.

7) MASTER presents:
    - ADR_O = A0
    - TGA_O = "request to proceed atomically"

8) SLAVE acknowledges

9) MASTER presents:
    - DAT_O = D0

10) SLAVE acknowledges with "data has been written".
   it also empties the buffer containing the ID.

in all it is quite a complex protocol, and i really cannot see how it can be
avoided,


the "Phase 1" parts involve knowledge of the addresses associated with the
peripherals: in the case of Virtual Memory this will be *dynamic* information.

the thing is that we even need this for mis-aligned as well as atomic operations
and also for 128-bit atomic writes over 64-bit buses.
Comment 1 Yehowshua 2020-06-21 21:51:05 BST
This makes sense.
I went through the LDST docs on the wiki yesterday too and learned as much.

Wishbone would have to be the very bottom of the stack after we have some LDST operation that we know **must be executed.

Clearly, wishbone is inappropriate for L0.
Comment 2 Jacob Lifshay 2020-06-21 21:56:31 BST
(In reply to Luke Kenneth Casson Leighton from comment #0)
> the requirements are - and this is not optional - that memory requests be
> subdivided into two phases:
> 
> 1) checking whether the request *CAN* be completed - WITHOUT EXCEPTIONS - if
>    it were to be permitted to proceed
> 
> 2) allowing the memory request to proceed.

There are some peripherals where they only error after proceeding to the point where it's impossible to cancel a request, so, our memory request state diagram needs to properly handle that, without just hard-locking the CPU or creating an inprecise interrupt.

This would involve waiting until non-speculative for non-cachable memory addresses and potentially serializing operations. Cachable memory that is known to be in the cache can be speculatively read and non-speculatively written in parallel, no serialization required (unless using memory fences/atomic ops). Cachable memory that is in the cache is also known to not cause memory exceptions (assuming MMU translation and checking has already been done and ignoring ECC cache memory failures).
Comment 3 Luke Kenneth Casson Leighton 2020-06-21 22:05:58 BST
(In reply to Yehowshua from comment #1)
> This makes sense.
> I went through the LDST docs on the wiki yesterday too and learned as much.

i put the bit about contracts here:
https://libre-soc.org/3d_gpu/architecture/6600scoreboard/discussion/
 
> Wishbone would have to be the very bottom of the stack after we have some
> LDST operation that we know **must be executed.

*deep breath* - i thought so, too.  unfortunately, the discrepancy between
the two types of contracts "here's my offer, take it or leave it" and
"here's an offer, please take all the time you need to think about it"
are so fundamentally and diametrically incompatible that we can't even
use wishbone - as-is at the lowest level!

or.. we can... as long as that usage is absolutely guaranteed one hundred
percent NEVER to fail or raise any kind of error.

in other words, you may *only* use "take-it-or-leave-it" contracts (buses
such as Wishbone) for the **TAKE-IT** part, because the actual Contract
of Sale clearly states that things have moved *into* the "TAKE IT" phase.


> Clearly, wishbone is inappropriate for L0.

well... an *augmented* version of wishbone is appropriate (one that
obeys the standard "Contract of Sale" outlined above)

or, we can use standard-Wishbone for "take it" (i.e. if is guaranteed
that there will be no errors.

we will still actually need to keep that error capability, however if
such an error does occur (at the low levels) its status is raised to
"catastrophic contract violation" and we halt the processor or fire
a "severe NMI hard-fault" trap condition.
Comment 4 Luke Kenneth Casson Leighton 2020-06-21 22:11:08 BST
(In reply to Jacob Lifshay from comment #2)
> (In reply to Luke Kenneth Casson Leighton from comment #0)
> > the requirements are - and this is not optional - that memory requests be
> > subdivided into two phases:
> > 
> > 1) checking whether the request *CAN* be completed - WITHOUT EXCEPTIONS - if
> >    it were to be permitted to proceed
> > 
> > 2) allowing the memory request to proceed.
> 
> There are some peripherals where they only error after proceeding to the
> point where it's impossible to cancel a request, so, our memory request
> state diagram needs to properly handle that, without just hard-locking the
> CPU or creating an inprecise interrupt.

yes.  POWER architecture recognises that these peripherals exist, and puts
them into the "atomic" category.  there's a section on them, somewhere.

this in turn holds up (entirely) all subsequently-issued LD/STs even from
exiting anything beyond the "GO_ADDR" phase (the computation of the
Effective Address).


> This would involve waiting until non-speculative for non-cachable memory
> addresses and potentially serializing operations.

correct.  once the (effectively atomic) LD/ST had proceeded past its
"take-it-or-leave-it" contract, further "speculative" contracts may
proceed in parallel.

(see https://libre-soc.org/3d_gpu/architecture/6600scoreboard/discussion/
for explanation of the contract terminology)
Comment 5 Luke Kenneth Casson Leighton 2020-06-22 00:36:07 BST
i'm reading the wb4 spec and table in section 3.1.6 shows the 4 types of userdefined signals permitted to ne added.  ir, more to the pount, if added rgey must be "tagged" in the datasheet and must also respect the timing protocol associated with that tag.

however none of these 4 tag types perfectly fit the "shadow" system aka "standard contract of sale".

we may have to do this:

* define a "cycle" tagged signal that indicates that the bus is to follow "shadow" protocol.  this is raised and geld for the whole CYC_O

* at that point, the slave can assume that all operations are implicitly under "shadow" conditions and that it must wait for "success or fail".

* the address will be sent as normal for a read

* however the slave MUST wait for the master to raise a DATAO tag (despite this being a read) of EITHER success or fail (GO_DIE).

i am inclined to recommend that the slave be required to raise STALL_I at this point, until either success or fail is raised.

btw that fail is synonymous with RST.  it is the same thing.  however i do not know at this point if is a bit drastic to do a full RST.

alternatively we could simply specify that if cyc is dropped when STALL_I is raised this is equivalent to "GO_DIE".

success on the other hand is simple enough. 

remember - irritatingly - we cannot pass shadow itself through because it does not fit *any* of the 4 tag types.

unless of course we simply define that it is permitted to be.

this would be easier and fit better.

more thought required.
Comment 6 Luke Kenneth Casson Leighton 2020-06-22 23:41:31 BST
right.  another thought occurred to me.

1. peripherals have to be done as "take it or leave it" style wishbone access.

2. main memory (DRAM) also falls onto this category.  note: that's not *cached* memory, it's *actual* memory (via SDRAM wishbone controller or LiteDRAM etc)

3. it is only the *processor* that needs to perform these speculative style "house contract of sale" requests.

4. therefore we *are* actually free and at liberty to design and use an internal bus architecture, which L0, L1, L2 and TLB and MMU understand, that respects the "house contract of sale" interface, this being an internal protocol.

5. however when interfacing to *peripherals* we must treat them as atomic and can use the take-it-or-leave-it protocol, falling back to single blocking operations and thus safely use wishbone.

6. as far as memory (DRAM) is concerned, as long as *batches* are respected (batches of LD requests that do not overlap with batches of STs) and once we have determined that the addresses of all batches are valid these LD-only or ST-only can be done in any order at any width.

in addition: given that we are only doing  a single core we have only one access route to memory to worry about.

we are also not going to put VM in... yet.

now, the discerning factor which tells us the difference between memory and peripherals is: the address.

and it is the address that we need to check first at the "house contract Phase 1".

this is incredibly simple:

* is address in range of real DRAM, yes or no.  if yes, we ASSUME, reasonably, that when it proceeds to Phase 2 it will succeed.

after that point we *CAN* in fact use minerva for accessing DRAM because it is guaranteed to succeed.  errors however are promoted to "catastrophic".

for peripherals, these fall back to atomic blocking operations so we can *still* use minerva LoadStoreInterface however errors are straight exceptions rather than catastrophic.

for peripherals the L1 cache must also be bypassed because you have to actually do the read and actually do the write.  this is slow and is what DMA is for, but hey.