Bug 206 - Implement branch prediction
Summary: Implement branch prediction
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: All All
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 22
  Show dependency treegraph
 
Reported: 2020-03-02 21:09 GMT by Jacob Lifshay
Modified: 2021-05-27 08:53 BST (History)
2 users (show)

See Also:
NLnet milestone: NLnet.2019.02
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation: 22
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Lifshay 2020-03-02 21:09:52 GMT
TODO: fill with overview and mention branch prediction, branch target buffers, call-return predictors, loop predictors, etc.

We should err on the side of having a more complex branch predictor due to the huge impact on performance.

To avoid spectre-style data leaks where feasible, we should only send
taken/not-taken data back to the branch predictor once all preceding branches are no longer speculative (tracked by dependency matrix).

useful info:
https://danluu.com/branch-prediction/
Comment 1 Luke Kenneth Casson Leighton 2020-03-02 21:37:19 GMT
yes good call.  i have branch-decision execution working (6 months ago), it needs a proper decision system.

i really liked the BOOM algorithm, it should be in here:
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-151.pdf

the reason it was good was because it adapted very quickly yet got things
right a lot.
Comment 2 Luke Kenneth Casson Leighton 2020-03-02 21:39:39 GMT
(In reply to Jacob Lifshay from comment #0)

> To avoid spectre-style data leaks where feasible, we should only send
> taken/not-taken data back to the branch predictor once all preceding
> branches are no longer speculative (tracked by dependency matrix).

that's going to be interesting.  all *preceding* branches.  do they
mean branches that are in play whilst *other* branches are still in play?
double-shadowing, in other words?
Comment 3 Jacob Lifshay 2020-03-02 22:01:41 GMT
(In reply to Luke Kenneth Casson Leighton from comment #2)
> (In reply to Jacob Lifshay from comment #0)
> 
> > To avoid spectre-style data leaks where feasible, we should only send
> > taken/not-taken data back to the branch predictor once all preceding
> > branches are no longer speculative (tracked by dependency matrix).
> 
> that's going to be interesting.  all *preceding* branches.  do they
> mean branches that are in play whilst *other* branches are still in play?
> double-shadowing, in other words?

I meant where there could be a branch or trap before the current branch. Once the current branch is known to be executed, only then can it send taken/not-taken info back to the branch predictor. Before then, the branch predictor has to assume that the branch was predicted correctly.

Basically, it all boils down to the branch predictor can't get information from data that was calculated speculatively by the out-of-order engine until it's not speculative anymore.

Alternatively, as is done in BOOM, the branch predictor must revert all visible state to what it was before the first mispredicted branch or trap.

Note that that is different from getting info from instruction fetch (such as: is an instruction a branch or something else), since those are assumed to not change without an icache flush.
Comment 4 Luke Kenneth Casson Leighton 2020-03-02 22:39:07 GMT
(In reply to Jacob Lifshay from comment #3)
> (In reply to Luke Kenneth Casson Leighton from comment #2)
> > (In reply to Jacob Lifshay from comment #0)
> > 
> > > To avoid spectre-style data leaks where feasible, we should only send
> > > taken/not-taken data back to the branch predictor once all preceding
> > > branches are no longer speculative (tracked by dependency matrix).
> > 
> > that's going to be interesting.  all *preceding* branches.  do they
> > mean branches that are in play whilst *other* branches are still in play?
> > double-shadowing, in other words?
> 
> I meant where there could be a branch or trap before the current branch.
> Once the current branch is known to be executed, only then can it send
> taken/not-taken info back to the branch predictor.

ok, right.  so this is when the shadow on the current branch is still active. when that shadow is released, the *only* reason it would be dropped is because the result (direction) is then known.

(shadows switch off - prohibit - dangerous side-effect operations such as "write to register" and "write to memory" - without preventing the operation from being *calculated*. thus although execution resources might be wasted by running them then cancelling them, at least no damage can occur)

> Before then, the branch
> predictor has to assume that the branch was predicted correctly.

ok i _think_ things are a little different (simpler) here, because the shadow system is easily accessible (a single bit).  if that bit is still active, there's no need for assumptions: you *know* that the branch prediction has not been decided yet.


> Basically, it all boils down to the branch predictor can't get information
> from data that was calculated speculatively by the out-of-order engine until
> it's not speculative anymore.

yes.  ok.  and that's when that branch-shadow flag is dropped (with a corresponding "success or cancel" flag which tells us if the branch was correctly predicted or if it was wrong).

here's the relevant code:
https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/scoreboard/shadow.py;h=12f20893b5accb78ac69a8c64aff8b97eb47f05c;hb=8ef8a90c823cd7edb4d85c8099e9a912adbe4ea4#l98

you can see there, the inputs on what's expected to occur (branch THEN
rather than ELSE) and on a later cycle, there are another two flags
indicating if the branch went on the THEN or the ELSE).


> Alternatively, as is done in BOOM, the branch predictor must revert all
> visible state to what it was before the first mispredicted branch or trap.

the augmented 6600 doesn't work like that: it's a lot simpler.  okok it *looks* like "reversion" but actually it's much more straightforward: when a shadow is enabled, "writes" are prohibited whilst "calculation" is allowed.

that's really all there is to it.  you can think of it as "reversion" but that's a misleading misnomer, because due to the separation between the "write commit" and "calculation" phases, there's nothing to actually "revert".

no changes hit the register file until that shadow decision says "yes go ahead".

no writes to memory hit the register file until that shadow decision says "yes go ahead".

and the shadow decision *only* gets made if there is a HUNDRED PERCENT guarantee that the write is allowed to proceed, i.e. that there is no
*need* for "reversion".

so it... completely *avoids* reversion by stacking up the intermediate
data in the Function Unit "result" latches.

[the "register bypass" mechanism allows those results to be used as inputs
to other FUs so that there's no hold-up due to speculative chained computations]


> Note that that is different from getting info from instruction fetch (such
> as: is an instruction a branch or something else), since those are assumed
> to not change without an icache flush.

it's reasonable and normal to assume that the icache is read-only, we're
not going to support self-modifying code.  it's too much :)
Comment 5 Jacob Lifshay 2020-03-02 23:02:14 GMT
(In reply to Luke Kenneth Casson Leighton from comment #4)
> ok i _think_ things are a little different (simpler) here, because the
> shadow system is easily accessible (a single bit).  if that bit is still
> active, there's no need for assumptions: you *know* that the branch
> prediction has not been decided yet.

Some branch prediction algorithms have a global history buffer where every branch is recorded and, in order for the predictor to function properly (stay synchronized), there has to be some value entered into the global history for each branch that's fetched even if those branches are not known to be correct yet.

> 
> > Basically, it all boils down to the branch predictor can't get information
> > from data that was calculated speculatively by the out-of-order engine until
> > it's not speculative anymore.
> 
> yes.  ok.  and that's when that branch-shadow flag is dropped (with a
> corresponding "success or cancel" flag which tells us if the branch was
> correctly predicted or if it was wrong).
> 
> here's the relevant code:
> https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/scoreboard/shadow.py;
> h=12f20893b5accb78ac69a8c64aff8b97eb47f05c;
> hb=8ef8a90c823cd7edb4d85c8099e9a912adbe4ea4#l98
> 
> you can see there, the inputs on what's expected to occur (branch THEN
> rather than ELSE) and on a later cycle, there are another two flags
> indicating if the branch went on the THEN or the ELSE).
> 
> 
> > Alternatively, as is done in BOOM, the branch predictor must revert all
> > visible state to what it was before the first mispredicted branch or trap.
> 
> the augmented 6600 doesn't work like that: it's a lot simpler.  okok it
> *looks* like "reversion" but actually it's much more straightforward: when a
> shadow is enabled, "writes" are prohibited whilst "calculation" is allowed.

It is actually reversion since the branch predictor is part of the instruction fetch pipeline -- before the 6600 OoO engine has a chance to affect it. The branch predictor (assuming we don't use a really simplistic one) updates it's state every cycle -- even before it knows the fetched instruction words. When there is a mispredicted branch, we need to reset the branch predictor to the state it had when that branch was fetched so we can execute the proper sequence of branch predictor steps.

> 
> it's reasonable and normal to assume that the icache is read-only, we're
> not going to support self-modifying code.  it's too much :)

As long as we do support dynamically loading new code by executing the icache sync instruction -- the Vulkan driver (as well as Linux itself) critically depends on that to be able to execute JIT compiled code.
Comment 6 Luke Kenneth Casson Leighton 2020-03-02 23:30:57 GMT
(In reply to Jacob Lifshay from comment #5)
> (In reply to Luke Kenneth Casson Leighton from comment #4)
> > ok i _think_ things are a little different (simpler) here, because the
> > shadow system is easily accessible (a single bit).  if that bit is still
> > active, there's no need for assumptions: you *know* that the branch
> > prediction has not been decided yet.
> 
> Some branch prediction algorithms have a global history buffer where every
> branch is recorded and, in order for the predictor to function properly
> (stay synchronized), there has to be some value entered into the global
> history for each branch that's fetched even if those branches are not known
> to be correct yet.

eurrgh ok, got it.  blegh :)

and that's where the prediction-attacks come into play, hence the article.

> It is actually reversion since the branch predictor is part of the
> instruction fetch pipeline -- before the 6600 OoO engine has a chance to
> affect it. The branch predictor (assuming we don't use a really simplistic
> one) updates its state every cycle -- even before it knows the fetched
> instruction words. When there is a mispredicted branch, we need to reset the
> branch predictor to the state it had when that branch was fetched so we can
> execute the proper sequence of branch predictor steps.

now i'm lost.  and need to understand.

the Branch Function Unit is very simple: it keeps a record of the "alternative"
path Program Counter.  if the branch "fails" then all that happens is that
the shadow "cancel" is called (which destroys all the operations that either
haven't yet completed or were in "waiting to write" mode), the PC will be set
to that "alternative" address, and instruction issue continues from that point.

this gives the *appearance* of reversion, rather than "undoing damaged
register files and memory locations by way of quotes reverting quotes
*actual* register writes etc." because those writes were *never permitted
to take place in the first place*.

trying to "keep track" of that, in a separate ring-buffer (for example),
where the completion (or otherwise) allows for "reversion" by way of
dropping the top items on the buffer...

mmm... i can't say i'm keen on that idea.

we need help, here.


> > it's reasonable and normal to assume that the icache is read-only, we're
> > not going to support self-modifying code.  it's too much :)
> 
> As long as we do support dynamically loading new code by executing the
> icache sync instruction -- the Vulkan driver (as well as Linux itself)
> critically depends on that to be able to execute JIT compiled code.

good catch.  a quick search online shows what isync is / used for:
https://www.ibm.com/developerworks/systems/articles/powerpc.html
Comment 7 Jacob Lifshay 2020-03-03 01:32:10 GMT
(In reply to Luke Kenneth Casson Leighton from comment #6)
> now i'm lost.  and need to understand.
> 
> <snip>
> 
> this gives the *appearance* of reversion, rather than "undoing damaged
> register files and memory locations by way of quotes reverting quotes
> *actual* register writes etc." because those writes were *never permitted
> to take place in the first place*.

(second time writing this all, my text editor crashed on my phone)

What's reverted is the branch predictor's logical internal state (as well as the fetch PC and branch target buffer), not all the registers and memory locations that are written to.

Oversimplified Diagram:

Branch Predictor/BTB  <----+
   |                       |
   v                       |
Fetch <--------------------+
   |                       |
   v                       |
Decode <-------------------+
   |                       |
   v                       |
Issue Queue <--------------+
   |    ^                  |
   v    |                  |
6600 Scheduler <-----------+
   |     ^   ^             |
   |     |   |             |
   +--> ALU  |             |
   |         |             |
   +--> Load/Store         |
   |                       |
   +--> Branch Execution --+

The branch predictor is logically separate from the branch execution unit -- the branch predictor is part of the fetch pipeline, while the branch execution unit is after the 6600 OoO engine has a chance to schedule and manage the instructions, which can only happen after the fetch pipeline gives the OoO engine the required instructions.

Basically, the 6600 OoO engine can only manage speculative execution of the instructions it's in charge of, where the fetch pipeline is in charge of instructions that haven't yet reached the OoO engine, so the fetch pipeline has to be able to cancel instructions that were erroneously fetched. Part of canceling instructions is reverting the state of the branch predictor such that those erroneously fetched instructions don't show up in the branch predictor's logical state and the correct result of branch execution shows up instead.

Hopefully this all makes more sense.
Comment 8 Luke Kenneth Casson Leighton 2020-03-03 11:34:21 GMT
(In reply to Jacob Lifshay from comment #7)

> (second time writing this all, my text editor crashed on my phone)

bleh. pain when that hapoens

> What's reverted is the branch predictor's logical internal state (as well as
> the fetch PC and branch target buffer), not all the registers and memory
> locations that are written to.

ok got the terminology. with you.

> Oversimplified Diagram:
> 
> Branch Predictor/BTB  <----+
>    |                       |
>    v                       |
> Fetch <--------------------+
>    |                       |
>    v                       |
> Decode <-------------------+
>    |                       |
>    v                       |
> Issue Queue <--------------+
>    |    ^                  |
>    v    |                  |
> 6600 Scheduler <-----------+
>    |     ^   ^             |
>    |     |   |             |
>    +--> ALU  |             |
>    |         |             |
>    +--> Load/Store         |
>    |                       |
>    +--> Branch Execution --+
> 
> The branch predictor is logically separate from the branch execution unit --
> the branch predictor is part of the fetch pipeline, while the branch
> execution unit is after the 6600 OoO engine has a chance to schedule and
> manage the instructions, which can only happen after the fetch pipeline
> gives the OoO engine the required instructions.
> 
> Basically, the 6600 OoO engine can only manage speculative execution of the
> instructions it's in charge of, where the fetch pipeline is in charge of
> instructions that haven't yet reached the OoO engine, so the fetch pipeline
> has to be able to cancel instructions that were erroneously fetched. Part of
> canceling instructions is reverting the state of the branch predictor such
> that those erroneously fetched instructions don't show up in the branch
> predictor's logical state and the correct result of branch execution shows
> up instead.
> 
> Hopefully this all makes more sense.

it reminds me of the vehicle simulator i did in 1994. SpecManager. it's sold on ebay still! i was amazed to find it there :)

there i created a series of "states" representing the vehicle: position, gear, speed, and from that computed acceleration fuel economy etc as a separate struct.

chains of these were calculated, giving a "forward look" into the future.  it was  obviously, possible to run that multiple times, say by saying "downshift" in one chain, and "stay in gear" on another.

after several chains were done (for short distances) the "best" was selected and the rest discarded.

this is the fundamental basis apparently of "game theory" although i only found that out after 18 months work on it.

i suspect we need something similar: a set of tables which are separate from each other: one that is for KNOWN program pathing (path-ing) and one (or more, if we have multiple Branch Units) for currently-unknown (speculative) paths.

on correct selection of a path the statistics collected from that path are merged into the "KNOWN" table(s).

if that path information is actually needed during fetch (even if it is speculative) then a "merge" function can take the data from both tables and make decisions.

hm i am going to reach out to comp.arch see what people think, there.  there is
likely a real simple elegant "thing"
which decades of collective experience knows of immediately.
Comment 9 Luke Kenneth Casson Leighton 2020-03-03 12:08:26 GMT
https://groups.google.com/forum/?nomobile=true#!topic/comp.arch/1OvTXelf5TE

do pitch in, there, i've crossref'd.  comp.arch people are fun and really knowledgeable.
Comment 10 Luke Kenneth Casson Leighton 2020-03-04 00:26:10 GMT
https://www.youtube.com/watch?v=yk-U6qqeGE0

i love this guy :) he talks so fast, and makes it clear that you use the first few bits of the address as the index into the Branch Lookup SRAM.

the entries are:
* the full address (so that you know that you are matching the correct address, just like a Page Table Entry)
* the last branch address
* the prediction yes/no

so it is actually very similar to Page Tables, which is quite interesting.

will keep looking for more stuff
Comment 11 Luke Kenneth Casson Leighton 2020-03-04 00:37:40 GMT
https://www.youtube.com/watch?v=5d2VW0yO-xE

this one is also short, a little echoey, it explains why 2 bit prediction tables are used rather than 1.

basically you keep track of the last *two* predictions.  i am not sure exactly what the FSM is for, yet, though.
Comment 12 Luke Kenneth Casson Leighton 2020-03-04 15:49:34 GMT
ha, found some code that implements 1, 2 and 4bit FSMs, the last being a "correlator".  more on comp.arch
Comment 14 Luke Kenneth Casson Leighton 2020-03-04 16:32:54 GMT
idea: include VL in the branch hash lookup.  different VL gives different prediction entry.
Comment 15 Luke Kenneth Casson Leighton 2020-03-04 20:11:52 GMT
https://m.youtube.com/watch?v=wsllybIHICw
another useful video, this one on the 2-bit FSM.
basically it's, "if match, increment counter, if not-match decrement
counter, limit counter to range 0b11 to 0b00" and that's it.
Comment 18 Luke Kenneth Casson Leighton 2020-03-04 22:58:18 GMT
http://web.engr.oregonstate.edu/~benl/Projects/branch_pred/#l10
Comment 19 Luke Kenneth Casson Leighton 2020-03-06 01:31:47 GMT
https://github.com/patc15/mipscpu/blob/master/branch_history_table.v#L101

is this code (the block at line 101) *undoing* a miapredicted branch?

that means the SRAM has to be dual-ported, and if we do dual-issue it'll be *quad* ported.

eek
Comment 20 Jacob Lifshay 2020-03-06 01:38:54 GMT
(In reply to Luke Kenneth Casson Leighton from comment #19)
> that means the SRAM has to be dual-ported, and if we do dual-issue it'll be
> *quad* ported.
> 
> eek

I didn't look, but all we need to do is steal one of the ports while correcting mispredictions -- those are rare enough that slowing down fetch while fixing up the tables is acceptable.
Comment 21 Jacob Lifshay 2020-03-06 08:29:45 GMT
I created an archive of the comp.arch thread on branch prediction which also contains the discussion around my idea for spectre-proof speculative execution:

http://bugs.libre-riscv.org/show_bug.cgi?id=209#c2
Comment 22 Luke Kenneth Casson Leighton 2020-03-06 09:35:15 GMT
(In reply to Jacob Lifshay from comment #20)
> (In reply to Luke Kenneth Casson Leighton from comment #19)
> > that means the SRAM has to be dual-ported, and if we do dual-issue it'll be
> > *quad* ported.
> > 
> > eek
> 
> I didn't look, but all we need to do is steal one of the ports while
> correcting mispredictions -- those are rare enough that slowing down fetch
> while fixing up the tables is acceptable.

yeah that makes sense.  still slightly concerns me about multi-issue,
although actually it'll be the number of branches per window.
so if there are short loops, branch calc branch calc then quad-issue
only needs *dual* ported RAM, not quad-ported.
Comment 23 Luke Kenneth Casson Leighton 2020-03-06 10:27:39 GMT
https://github.com/patc15/mipscpu/blob/master/branch_history_table.v

this looks pretty straightforward.
Comment 24 Luke Kenneth Casson Leighton 2020-03-07 09:45:59 GMT
https://arxiv.org/abs/2002.03568 no source code though
Comment 25 Luke Kenneth Casson Leighton 2020-03-07 16:00:25 GMT
from ericp, comp.arch:

FYI I stumbled across this masters thesis the other day.
It gives a detailed design for a RISC-V front end with branch
prediction, including wiring diagrams and handshake signals,
with performance simulation results.

Design of the frontend for LEN5,
a RISC-V Out-of-Order processor, 2018
https://webthesis.biblio.polito.it/13198/1/tesi.pdf

and what appears to be the SystemVerilog for the project:

http://webthesis.biblio.polito.it/13198/2/allegati.zip
Comment 26 Luke Kenneth Casson Leighton 2020-03-07 17:39:11 GMT
got it:

    def __init__(self, width):
        self.opcode_in = Signal(width, reset_less=False)
        self.df = DecodeFields(SignalBitRange, [self.opcode_in])
        self.df.create_specs()
        self.x_s = Signal(len(self.df.FormX.S), reset_less=True)
        self.x_sh = Signal(len(self.df.FormX.SH), reset_less=True)
        self.dq_xs_s = Signal(len(self.df.FormDQ.SX_S), reset_less=True)

    def elaborate(self, platform):
        m = Module()
        comb = m.d.comb
        comb += self.x_s.eq(self.df.FormX.S[0])
        comb += self.x_sh.eq(self.df.FormX.SH[0:-1])
        comb += self.dq_xs_s.eq(self.df.FormDQ.SX_S[0:-1])

so the field encodings are picked up from fields.txt, processed in a class
that records the ranges, and creates "pseudo"-accessors which allow the
bits to be indexed 0,1,2,3 rather than having to memorise 20,22,23,24
or refer, repeatedly, to the PDF manual
Comment 27 Luke Kenneth Casson Leighton 2020-03-12 00:13:46 GMT
from comp.arch TODO get link to preserve attribution

P
Paul A. Clayton
to
2 hours agoDetails
On Wednesday, March 11, 2020 at 4:31:56 PM UTC-4, Stefan Monnier wrote:
> > the difference between a global history table and a local history table:
> > a global table takes some sort of hash (or lru) of the entire PC, whilst the
> > local table will take say the 10 LSBs?
>
> IIRC the different is that a local history table is indexed by (a hash
> of) the PC of the branch instruction, whereas the global history table
> is indexed by (a hash of) the trace of branch decisions taken before
> reaching the branch instruction.
>
> So a 10bit index could be respectively:
> - for an LHT, the 10 LSb of the address of the branch instruction
> - for an GHT, the direction (taken/not taken) of the last 10 branch
>   instructions executed.
>
> Then again, maybe I'm just confused.

Traditional gshare global tables are indexed by an XORing of
branch address bits (typically just the least significant
bits of the branch address) and a taken/not-taken global history
string. More recent advances have hashed larger history strings
and used what is called path history (information from addresses
of branches) rather than just direction (taken/not-taken) history
with more complex hashing of the index.

TAGE-based predictors use partial tags to reduce aliasing and
(I suspect) accelerate training and use GEometric history
lengths (which provides fast training and the ability to use
very long history strings).

(One idea I would like to test — I doubt it would be helpful —
is using bimode-like storage for a TAGE with different predictions
using different slots. This would effectively require doubling
look-up width, but with overlaided skewed associativity the
problem of capacity bias would be reduced. I did try XORing a
tag bit with the prediction or the hysteresis bit for a per
address prediction and it turned out that constructive/non-
destructive aliasing is more common than I supposed [but the
test also did not filter not-taken branches using a BTB]. I
also suspect that dynamic sizing of TAGE tables for various
lengths could also be beneficial.)

Most local history predictors in current use (excluding, e.g.,
that in the Alpha 210264) only use a saturating counter per
address rather than tracking local history strings and
predicting based on those either with a separate table or a
hardwired algorithm.
Comment 28 Jacob Lifshay 2020-03-12 00:19:41 GMT
(In reply to Luke Kenneth Casson Leighton from comment #27)
> from comp.arch TODO get link to preserve attribution

I manually archived the usenet thread as it was a few days ago, see comment #21
Comment 29 Jacob Lifshay 2020-09-05 00:04:36 BST
change nlnet milestone to match parent bug's milestone
Comment 30 Jacob Lifshay 2021-05-27 03:45:54 BST
I found a very interesting paper on the effects of modern branch predictors on the performance of interpreters:
Branch Prediction and the Performance of Interpreters -
Don’t Trust Folklore
https://hal.inria.fr/hal-01100647/document

They tested a few Intel microarchitectures, as well as the combination of TAGE and ITTAGE (which we very much should implement if we want decent cpu performance).

Important takeaway: we *need* a good indirect-branch predictor (ITTAGE is a good one). They can reduce branch mis-predictions to around 1-2 per thousand instructions in stereotypical interpreter inner loops -- waay better than I expected.
Comment 31 Luke Kenneth Casson Leighton 2021-05-27 08:53:13 BST
one nice thing, SVSTATE can be hashed in to the tables in order to
change the branch prediction behaviour, particularly on the last
occurrence of a loop.