TODO: fill with overview and mention branch prediction, branch target buffers, call-return predictors, loop predictors, etc. We should err on the side of having a more complex branch predictor due to the huge impact on performance. To avoid spectre-style data leaks where feasible, we should only send taken/not-taken data back to the branch predictor once all preceding branches are no longer speculative (tracked by dependency matrix). useful info: https://danluu.com/branch-prediction/
yes good call. i have branch-decision execution working (6 months ago), it needs a proper decision system. i really liked the BOOM algorithm, it should be in here: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-151.pdf the reason it was good was because it adapted very quickly yet got things right a lot.
(In reply to Jacob Lifshay from comment #0) > To avoid spectre-style data leaks where feasible, we should only send > taken/not-taken data back to the branch predictor once all preceding > branches are no longer speculative (tracked by dependency matrix). that's going to be interesting. all *preceding* branches. do they mean branches that are in play whilst *other* branches are still in play? double-shadowing, in other words?
(In reply to Luke Kenneth Casson Leighton from comment #2) > (In reply to Jacob Lifshay from comment #0) > > > To avoid spectre-style data leaks where feasible, we should only send > > taken/not-taken data back to the branch predictor once all preceding > > branches are no longer speculative (tracked by dependency matrix). > > that's going to be interesting. all *preceding* branches. do they > mean branches that are in play whilst *other* branches are still in play? > double-shadowing, in other words? I meant where there could be a branch or trap before the current branch. Once the current branch is known to be executed, only then can it send taken/not-taken info back to the branch predictor. Before then, the branch predictor has to assume that the branch was predicted correctly. Basically, it all boils down to the branch predictor can't get information from data that was calculated speculatively by the out-of-order engine until it's not speculative anymore. Alternatively, as is done in BOOM, the branch predictor must revert all visible state to what it was before the first mispredicted branch or trap. Note that that is different from getting info from instruction fetch (such as: is an instruction a branch or something else), since those are assumed to not change without an icache flush.
(In reply to Jacob Lifshay from comment #3) > (In reply to Luke Kenneth Casson Leighton from comment #2) > > (In reply to Jacob Lifshay from comment #0) > > > > > To avoid spectre-style data leaks where feasible, we should only send > > > taken/not-taken data back to the branch predictor once all preceding > > > branches are no longer speculative (tracked by dependency matrix). > > > > that's going to be interesting. all *preceding* branches. do they > > mean branches that are in play whilst *other* branches are still in play? > > double-shadowing, in other words? > > I meant where there could be a branch or trap before the current branch. > Once the current branch is known to be executed, only then can it send > taken/not-taken info back to the branch predictor. ok, right. so this is when the shadow on the current branch is still active. when that shadow is released, the *only* reason it would be dropped is because the result (direction) is then known. (shadows switch off - prohibit - dangerous side-effect operations such as "write to register" and "write to memory" - without preventing the operation from being *calculated*. thus although execution resources might be wasted by running them then cancelling them, at least no damage can occur) > Before then, the branch > predictor has to assume that the branch was predicted correctly. ok i _think_ things are a little different (simpler) here, because the shadow system is easily accessible (a single bit). if that bit is still active, there's no need for assumptions: you *know* that the branch prediction has not been decided yet. > Basically, it all boils down to the branch predictor can't get information > from data that was calculated speculatively by the out-of-order engine until > it's not speculative anymore. yes. ok. and that's when that branch-shadow flag is dropped (with a corresponding "success or cancel" flag which tells us if the branch was correctly predicted or if it was wrong). here's the relevant code: https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/scoreboard/shadow.py;h=12f20893b5accb78ac69a8c64aff8b97eb47f05c;hb=8ef8a90c823cd7edb4d85c8099e9a912adbe4ea4#l98 you can see there, the inputs on what's expected to occur (branch THEN rather than ELSE) and on a later cycle, there are another two flags indicating if the branch went on the THEN or the ELSE). > Alternatively, as is done in BOOM, the branch predictor must revert all > visible state to what it was before the first mispredicted branch or trap. the augmented 6600 doesn't work like that: it's a lot simpler. okok it *looks* like "reversion" but actually it's much more straightforward: when a shadow is enabled, "writes" are prohibited whilst "calculation" is allowed. that's really all there is to it. you can think of it as "reversion" but that's a misleading misnomer, because due to the separation between the "write commit" and "calculation" phases, there's nothing to actually "revert". no changes hit the register file until that shadow decision says "yes go ahead". no writes to memory hit the register file until that shadow decision says "yes go ahead". and the shadow decision *only* gets made if there is a HUNDRED PERCENT guarantee that the write is allowed to proceed, i.e. that there is no *need* for "reversion". so it... completely *avoids* reversion by stacking up the intermediate data in the Function Unit "result" latches. [the "register bypass" mechanism allows those results to be used as inputs to other FUs so that there's no hold-up due to speculative chained computations] > Note that that is different from getting info from instruction fetch (such > as: is an instruction a branch or something else), since those are assumed > to not change without an icache flush. it's reasonable and normal to assume that the icache is read-only, we're not going to support self-modifying code. it's too much :)
(In reply to Luke Kenneth Casson Leighton from comment #4) > ok i _think_ things are a little different (simpler) here, because the > shadow system is easily accessible (a single bit). if that bit is still > active, there's no need for assumptions: you *know* that the branch > prediction has not been decided yet. Some branch prediction algorithms have a global history buffer where every branch is recorded and, in order for the predictor to function properly (stay synchronized), there has to be some value entered into the global history for each branch that's fetched even if those branches are not known to be correct yet. > > > Basically, it all boils down to the branch predictor can't get information > > from data that was calculated speculatively by the out-of-order engine until > > it's not speculative anymore. > > yes. ok. and that's when that branch-shadow flag is dropped (with a > corresponding "success or cancel" flag which tells us if the branch was > correctly predicted or if it was wrong). > > here's the relevant code: > https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/scoreboard/shadow.py; > h=12f20893b5accb78ac69a8c64aff8b97eb47f05c; > hb=8ef8a90c823cd7edb4d85c8099e9a912adbe4ea4#l98 > > you can see there, the inputs on what's expected to occur (branch THEN > rather than ELSE) and on a later cycle, there are another two flags > indicating if the branch went on the THEN or the ELSE). > > > > Alternatively, as is done in BOOM, the branch predictor must revert all > > visible state to what it was before the first mispredicted branch or trap. > > the augmented 6600 doesn't work like that: it's a lot simpler. okok it > *looks* like "reversion" but actually it's much more straightforward: when a > shadow is enabled, "writes" are prohibited whilst "calculation" is allowed. It is actually reversion since the branch predictor is part of the instruction fetch pipeline -- before the 6600 OoO engine has a chance to affect it. The branch predictor (assuming we don't use a really simplistic one) updates it's state every cycle -- even before it knows the fetched instruction words. When there is a mispredicted branch, we need to reset the branch predictor to the state it had when that branch was fetched so we can execute the proper sequence of branch predictor steps. > > it's reasonable and normal to assume that the icache is read-only, we're > not going to support self-modifying code. it's too much :) As long as we do support dynamically loading new code by executing the icache sync instruction -- the Vulkan driver (as well as Linux itself) critically depends on that to be able to execute JIT compiled code.
(In reply to Jacob Lifshay from comment #5) > (In reply to Luke Kenneth Casson Leighton from comment #4) > > ok i _think_ things are a little different (simpler) here, because the > > shadow system is easily accessible (a single bit). if that bit is still > > active, there's no need for assumptions: you *know* that the branch > > prediction has not been decided yet. > > Some branch prediction algorithms have a global history buffer where every > branch is recorded and, in order for the predictor to function properly > (stay synchronized), there has to be some value entered into the global > history for each branch that's fetched even if those branches are not known > to be correct yet. eurrgh ok, got it. blegh :) and that's where the prediction-attacks come into play, hence the article. > It is actually reversion since the branch predictor is part of the > instruction fetch pipeline -- before the 6600 OoO engine has a chance to > affect it. The branch predictor (assuming we don't use a really simplistic > one) updates its state every cycle -- even before it knows the fetched > instruction words. When there is a mispredicted branch, we need to reset the > branch predictor to the state it had when that branch was fetched so we can > execute the proper sequence of branch predictor steps. now i'm lost. and need to understand. the Branch Function Unit is very simple: it keeps a record of the "alternative" path Program Counter. if the branch "fails" then all that happens is that the shadow "cancel" is called (which destroys all the operations that either haven't yet completed or were in "waiting to write" mode), the PC will be set to that "alternative" address, and instruction issue continues from that point. this gives the *appearance* of reversion, rather than "undoing damaged register files and memory locations by way of quotes reverting quotes *actual* register writes etc." because those writes were *never permitted to take place in the first place*. trying to "keep track" of that, in a separate ring-buffer (for example), where the completion (or otherwise) allows for "reversion" by way of dropping the top items on the buffer... mmm... i can't say i'm keen on that idea. we need help, here. > > it's reasonable and normal to assume that the icache is read-only, we're > > not going to support self-modifying code. it's too much :) > > As long as we do support dynamically loading new code by executing the > icache sync instruction -- the Vulkan driver (as well as Linux itself) > critically depends on that to be able to execute JIT compiled code. good catch. a quick search online shows what isync is / used for: https://www.ibm.com/developerworks/systems/articles/powerpc.html
(In reply to Luke Kenneth Casson Leighton from comment #6) > now i'm lost. and need to understand. > > <snip> > > this gives the *appearance* of reversion, rather than "undoing damaged > register files and memory locations by way of quotes reverting quotes > *actual* register writes etc." because those writes were *never permitted > to take place in the first place*. (second time writing this all, my text editor crashed on my phone) What's reverted is the branch predictor's logical internal state (as well as the fetch PC and branch target buffer), not all the registers and memory locations that are written to. Oversimplified Diagram: Branch Predictor/BTB <----+ | | v | Fetch <--------------------+ | | v | Decode <-------------------+ | | v | Issue Queue <--------------+ | ^ | v | | 6600 Scheduler <-----------+ | ^ ^ | | | | | +--> ALU | | | | | +--> Load/Store | | | +--> Branch Execution --+ The branch predictor is logically separate from the branch execution unit -- the branch predictor is part of the fetch pipeline, while the branch execution unit is after the 6600 OoO engine has a chance to schedule and manage the instructions, which can only happen after the fetch pipeline gives the OoO engine the required instructions. Basically, the 6600 OoO engine can only manage speculative execution of the instructions it's in charge of, where the fetch pipeline is in charge of instructions that haven't yet reached the OoO engine, so the fetch pipeline has to be able to cancel instructions that were erroneously fetched. Part of canceling instructions is reverting the state of the branch predictor such that those erroneously fetched instructions don't show up in the branch predictor's logical state and the correct result of branch execution shows up instead. Hopefully this all makes more sense.
(In reply to Jacob Lifshay from comment #7) > (second time writing this all, my text editor crashed on my phone) bleh. pain when that hapoens > What's reverted is the branch predictor's logical internal state (as well as > the fetch PC and branch target buffer), not all the registers and memory > locations that are written to. ok got the terminology. with you. > Oversimplified Diagram: > > Branch Predictor/BTB <----+ > | | > v | > Fetch <--------------------+ > | | > v | > Decode <-------------------+ > | | > v | > Issue Queue <--------------+ > | ^ | > v | | > 6600 Scheduler <-----------+ > | ^ ^ | > | | | | > +--> ALU | | > | | | > +--> Load/Store | > | | > +--> Branch Execution --+ > > The branch predictor is logically separate from the branch execution unit -- > the branch predictor is part of the fetch pipeline, while the branch > execution unit is after the 6600 OoO engine has a chance to schedule and > manage the instructions, which can only happen after the fetch pipeline > gives the OoO engine the required instructions. > > Basically, the 6600 OoO engine can only manage speculative execution of the > instructions it's in charge of, where the fetch pipeline is in charge of > instructions that haven't yet reached the OoO engine, so the fetch pipeline > has to be able to cancel instructions that were erroneously fetched. Part of > canceling instructions is reverting the state of the branch predictor such > that those erroneously fetched instructions don't show up in the branch > predictor's logical state and the correct result of branch execution shows > up instead. > > Hopefully this all makes more sense. it reminds me of the vehicle simulator i did in 1994. SpecManager. it's sold on ebay still! i was amazed to find it there :) there i created a series of "states" representing the vehicle: position, gear, speed, and from that computed acceleration fuel economy etc as a separate struct. chains of these were calculated, giving a "forward look" into the future. it was obviously, possible to run that multiple times, say by saying "downshift" in one chain, and "stay in gear" on another. after several chains were done (for short distances) the "best" was selected and the rest discarded. this is the fundamental basis apparently of "game theory" although i only found that out after 18 months work on it. i suspect we need something similar: a set of tables which are separate from each other: one that is for KNOWN program pathing (path-ing) and one (or more, if we have multiple Branch Units) for currently-unknown (speculative) paths. on correct selection of a path the statistics collected from that path are merged into the "KNOWN" table(s). if that path information is actually needed during fetch (even if it is speculative) then a "merge" function can take the data from both tables and make decisions. hm i am going to reach out to comp.arch see what people think, there. there is likely a real simple elegant "thing" which decades of collective experience knows of immediately.
https://groups.google.com/forum/?nomobile=true#!topic/comp.arch/1OvTXelf5TE do pitch in, there, i've crossref'd. comp.arch people are fun and really knowledgeable.
https://www.youtube.com/watch?v=yk-U6qqeGE0 i love this guy :) he talks so fast, and makes it clear that you use the first few bits of the address as the index into the Branch Lookup SRAM. the entries are: * the full address (so that you know that you are matching the correct address, just like a Page Table Entry) * the last branch address * the prediction yes/no so it is actually very similar to Page Tables, which is quite interesting. will keep looking for more stuff
https://www.youtube.com/watch?v=5d2VW0yO-xE this one is also short, a little echoey, it explains why 2 bit prediction tables are used rather than 1. basically you keep track of the last *two* predictions. i am not sure exactly what the FSM is for, yet, though.
ha, found some code that implements 1, 2 and 4bit FSMs, the last being a "correlator". more on comp.arch
A survey of techniques for dynamic branch prediction, 2018 https://onlinelibrary.wiley.com/doi/full/10.1002/cpe.4666 https://www.researchgate.net/profile/Sparsh_Mittal/publication/324166804_A_Survey_of_Techniques_for_Dynamic_Branch_Prediction/links/5adfe5b8458515c60f63cf62/A-Survey-of-Techniques-for-Dynamic-Branch-Prediction.pdf
idea: include VL in the branch hash lookup. different VL gives different prediction entry.
https://m.youtube.com/watch?v=wsllybIHICw another useful video, this one on the 2-bit FSM. basically it's, "if match, increment counter, if not-match decrement counter, limit counter to range 0b11 to 0b00" and that's it.
https://git.libre-riscv.org/?p=shakti-core.git;a=blob;f=src/core/branchpredictor.bsv;h=d3b21a28a5b4f97392724869b84d1bae8332a2be;hb=0e7c323de747032d2cb01f7d1eecebcc53e42dc2
https://github.com/riscv-boom/riscv-boom/blob/master/src/main/scala/bpu/bpd/gshare/gshare.scala
http://web.engr.oregonstate.edu/~benl/Projects/branch_pred/#l10
https://github.com/patc15/mipscpu/blob/master/branch_history_table.v#L101 is this code (the block at line 101) *undoing* a miapredicted branch? that means the SRAM has to be dual-ported, and if we do dual-issue it'll be *quad* ported. eek
(In reply to Luke Kenneth Casson Leighton from comment #19) > that means the SRAM has to be dual-ported, and if we do dual-issue it'll be > *quad* ported. > > eek I didn't look, but all we need to do is steal one of the ports while correcting mispredictions -- those are rare enough that slowing down fetch while fixing up the tables is acceptable.
I created an archive of the comp.arch thread on branch prediction which also contains the discussion around my idea for spectre-proof speculative execution: http://bugs.libre-riscv.org/show_bug.cgi?id=209#c2
(In reply to Jacob Lifshay from comment #20) > (In reply to Luke Kenneth Casson Leighton from comment #19) > > that means the SRAM has to be dual-ported, and if we do dual-issue it'll be > > *quad* ported. > > > > eek > > I didn't look, but all we need to do is steal one of the ports while > correcting mispredictions -- those are rare enough that slowing down fetch > while fixing up the tables is acceptable. yeah that makes sense. still slightly concerns me about multi-issue, although actually it'll be the number of branches per window. so if there are short loops, branch calc branch calc then quad-issue only needs *dual* ported RAM, not quad-ported.
https://github.com/patc15/mipscpu/blob/master/branch_history_table.v this looks pretty straightforward.
https://arxiv.org/abs/2002.03568 no source code though
from ericp, comp.arch: FYI I stumbled across this masters thesis the other day. It gives a detailed design for a RISC-V front end with branch prediction, including wiring diagrams and handshake signals, with performance simulation results. Design of the frontend for LEN5, a RISC-V Out-of-Order processor, 2018 https://webthesis.biblio.polito.it/13198/1/tesi.pdf and what appears to be the SystemVerilog for the project: http://webthesis.biblio.polito.it/13198/2/allegati.zip
got it: def __init__(self, width): self.opcode_in = Signal(width, reset_less=False) self.df = DecodeFields(SignalBitRange, [self.opcode_in]) self.df.create_specs() self.x_s = Signal(len(self.df.FormX.S), reset_less=True) self.x_sh = Signal(len(self.df.FormX.SH), reset_less=True) self.dq_xs_s = Signal(len(self.df.FormDQ.SX_S), reset_less=True) def elaborate(self, platform): m = Module() comb = m.d.comb comb += self.x_s.eq(self.df.FormX.S[0]) comb += self.x_sh.eq(self.df.FormX.SH[0:-1]) comb += self.dq_xs_s.eq(self.df.FormDQ.SX_S[0:-1]) so the field encodings are picked up from fields.txt, processed in a class that records the ranges, and creates "pseudo"-accessors which allow the bits to be indexed 0,1,2,3 rather than having to memorise 20,22,23,24 or refer, repeatedly, to the PDF manual
from comp.arch TODO get link to preserve attribution P Paul A. Clayton to 2 hours agoDetails On Wednesday, March 11, 2020 at 4:31:56 PM UTC-4, Stefan Monnier wrote: > > the difference between a global history table and a local history table: > > a global table takes some sort of hash (or lru) of the entire PC, whilst the > > local table will take say the 10 LSBs? > > IIRC the different is that a local history table is indexed by (a hash > of) the PC of the branch instruction, whereas the global history table > is indexed by (a hash of) the trace of branch decisions taken before > reaching the branch instruction. > > So a 10bit index could be respectively: > - for an LHT, the 10 LSb of the address of the branch instruction > - for an GHT, the direction (taken/not taken) of the last 10 branch > instructions executed. > > Then again, maybe I'm just confused. Traditional gshare global tables are indexed by an XORing of branch address bits (typically just the least significant bits of the branch address) and a taken/not-taken global history string. More recent advances have hashed larger history strings and used what is called path history (information from addresses of branches) rather than just direction (taken/not-taken) history with more complex hashing of the index. TAGE-based predictors use partial tags to reduce aliasing and (I suspect) accelerate training and use GEometric history lengths (which provides fast training and the ability to use very long history strings). (One idea I would like to test — I doubt it would be helpful — is using bimode-like storage for a TAGE with different predictions using different slots. This would effectively require doubling look-up width, but with overlaided skewed associativity the problem of capacity bias would be reduced. I did try XORing a tag bit with the prediction or the hysteresis bit for a per address prediction and it turned out that constructive/non- destructive aliasing is more common than I supposed [but the test also did not filter not-taken branches using a BTB]. I also suspect that dynamic sizing of TAGE tables for various lengths could also be beneficial.) Most local history predictors in current use (excluding, e.g., that in the Alpha 210264) only use a saturating counter per address rather than tracking local history strings and predicting based on those either with a separate table or a hardwired algorithm.
(In reply to Luke Kenneth Casson Leighton from comment #27) > from comp.arch TODO get link to preserve attribution I manually archived the usenet thread as it was a few days ago, see comment #21
change nlnet milestone to match parent bug's milestone
I found a very interesting paper on the effects of modern branch predictors on the performance of interpreters: Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore https://hal.inria.fr/hal-01100647/document They tested a few Intel microarchitectures, as well as the combination of TAGE and ITTAGE (which we very much should implement if we want decent cpu performance). Important takeaway: we *need* a good indirect-branch predictor (ITTAGE is a good one). They can reduce branch mis-predictions to around 1-2 per thousand instructions in stereotypical interpreter inner loops -- waay better than I expected.
one nice thing, SVSTATE can be hashed in to the tables in order to change the branch prediction behaviour, particularly on the last occurrence of a loop.