https://groups.google.com/d/msg/comp.arch/leXwF1j7Z-A/VO99Gd55AwAJ this is just a spectacularly insightful set of optimisations that give huge improvements in IPC without needing a lot of infrastructure
from mitch alsup ---------------- On Thursday, April 30, 2020 at 11:59:17 AM UTC-5, BGB wrote: > But, don't know if I am doing well enough at this. > Did recently encounter some people online who balked at the idea of > including any VLIW-like aspects in an ISA, having the opinion that VLIW > is worthless and evil, ... It would not help their point of view to tell them that microcode is VLIW. > > > > Apparently, the commercial failure of Itanium has soiled some people as > to the possibility that there could be *any* merit of using some > similar-sounding ideas in small embedded systems, they can't seem to > help but imagine everything in terms of things trying to compete with > modern x86 PC's or Xeons... > The premise that someone might make a VLIW and then hope to have it be > competitive against lower end ARM processors or microcontrollers seems > alien to them. > > I guess the question is "does it make sense from a resource cost POV?", > eg, a small VLIW competing in terms of resource cost and performance > with a 2-wide in-order superscalar or similar. > > My guess is that a 2-wise superscalar could be doable on an FPGA, but > may involve adding another pipeline stage, and probably adding a bit to > branch latency (since now the PC advance can't be figured out until a > later stage in the pipeline, such as ID2 or similar). When one notices that, in general, we have:: a) 1 branch every 5 instructions, and 1 taken branch every 7.1 instructions b) 1 load every 5 instructions c) 1 store every 10 instructions AND one has an instruction buffer where one FETCHes 4 instructions (wide) and DECODEs out of that buffer:: a) branches do not deliver a result to the RF (rather to the IP) b) stores do not write a result So one should be able to get about 1.3 IPC from a 3R1W register file (minus latencies and dependencies). 1.4× is about what a 2-wide superscalar IO machine. So, we can get 80% of the gain of a 2-wide over a 1-wide by simply CoIssuing BRs and STs with other instructions! Now, given that we are FETCHing 4-wide, and we can decode the length of the instruction in 4 gates of delay, we can identify branches in the buffer far in advance of DECODEing the BR and we have time to FETCH the target so it arrives without delay compared to BR entering execution. So, CoIssue buys 30% and lookahead branches buys another 10%, we are within spitting distance of a 2-wide SS machine while paying only a bit more cost than a 1-wide machine--and we have not even added ports to the RF yet! > > Or maybe have the PC advance work conservatively (*) and then use > additional decoders to deal with the possibility that an operation may > need to be skipped during ID2?... That, too is possible for a 2%-4% gain. > > *: Will normally assume advancing 1 instruction at a time, but then be > advanced after-the-fact by "how many instructions were actually > executed". Fetch would then be the current (accurate) PC + 1 or 2 > instructions. ( Dunno; not actually implemented a proper superscalar yet. Done in binary, this is medium hard, done in unary it is not.