523 – demo program needed showing register dependencies

Bug 523 - demo program needed showing register dependencies

Summary: demo program needed showing register dependencies

Status:	RESOLVED INVALID

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Documentation (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Jacob Lifshay

URL:

Depends on:
Blocks:

Reported:	2020-10-27 21:35 GMT by Luke Kenneth Casson Leighton
Modified:	2020-11-13 14:34 GMT (History)
CC List:	4 users (show)

See Also:	352
NLnet milestone:	NLNet.2019.10.043.Wishbone
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2020-10-27 21:35:33 GMT

a (small! python) program is needed which generates the 6600 table(s) here:
https://libre-soc.org/3d_gpu/architecture/compared_to_register_renaming/

this to be implemented not as an actual simulator but as tracking dependency
hazards between registers, assuming (and tracking) multi-cycle completion,
and outputting the progression as a markdown table.

PowerISA bctr, LDU, STU, ADD and MUL to be included as instructions, to be put
in as a straight list of instructions as opposed to actually trying to
"execute" them as would normally be done in a simulator.

to include - as options:

* ability to change number of read and write ports
* op forwarding to be identified separately
* to treat CTR as being in a completely separate regfile as INT regs
* to be able to specify the completion length of the arithmetic part of
  operations
* to be able to specify the "pre" phases which default to fetch, decode
  but can be extended to e.g. fetch, decode1, decode2

Comment 1 Jacob Lifshay 2020-10-28 03:16:38 GMT

sync budget

Comment 2 Jacob Lifshay 2020-10-28 04:25:23 GMT

What do you think of writing it in Rust? That would have several benefits:

+ We can pretty easily build an interactive Web version using WebAssembly and Rust's tooling for the web. No server-side stuff would be needed, other than file hosting. I'm thinking type command line arguments into a text box and it dynamically regenerates an html table, nothing too fancy.

+ As a future extension, it could be easily integrated with power-instruction-analyzer to make a full-featured cycle-accurate (not just instruction accurate) simulator, where Rust's speed would be a huge advantage over Python.

+ It would still support command-line markdown table generation

Comment 3 Luke Kenneth Casson Leighton 2020-10-28 12:46:20 GMT

(In reply to Jacob Lifshay from comment #2)
> What do you think of writing it in Rust? That would have several benefits:
> 
> + We can pretty easily build an interactive Web version using WebAssembly
> and Rust's tooling for the web. No server-side stuff would be needed, other
> than file hosting. I'm thinking type command line arguments into a text box
> and it dynamically regenerates an html table, nothing too fancy.

the capabilities are compelling, i was also wondering if it would be beneficial
to do an interactive version.

> + As a future extension, it could be easily integrated with
> power-instruction-analyzer to make a full-featured cycle-accurate (not just
> instruction accurate) simulator, where Rust's speed would be a huge
> advantage over Python.

here's the thing jacob: as we have a low-to-medium priority short-medium requirement (increasing to high over the long-term) to be an "educational resource", rust's complex installation, learning curve and low adoption rate compared to a whopping 30% world-wide adoption and "single-binary-download" for python, it puts itself outside of the running.

imagine that LibreSOC became a study project for a secondary school, or that someone aged 15 with a raspberry pi wanted to learn how it worked.  just the installation process for rust is *so advanced* compared to python's simple installation that they'd literally get nothing done before giving up entirely.

(to give you some idea, there: my brother dan helped his daughter, from age 15, to complete programming assignments by using "online embedded web" versions of python on her chromebook)

python is just "step 1 click-and-download step 2 install step 3 get source step 4 run script" (and for the online web versions it's even simpler than that)

rust is "step 1 click-and-download step 2 install step 3 wait for it to get and build dependencies step 3 get source step 4 compile source step 5 wait for it to get build dependencies step 6 run binary"


back to the technical requirements: cycle-accurate simulators do not take into account the latency or overlap inside of pipelines (which is the object of the exercise, here).  the assumption made by a cycle-accurate simulator is that the instruction "will at least complete in exactly one cycle, transparently and independently of the underlying micro-architecture".

i.e. a cycle-accurate simulator models enough so that binaries have the exact same behaviour and guarantees of hardware *but* that at the end of single-stepping the entire "machine state" - including memory including caches and VM PTEs, is 100% up-to-date (not compromised as it would be by some JIT emulators for example)

consequently, modelling the instruction stages and associated latency is not something that ever goes into a cycle-accurate simulator because doing so massively slows it down and unnecessarily complicates the design.

bottom line is that the code for this exercise (even in rust) would be inappropriate for use as a "true" simulator, which is why i mentioned that it should not actually perform the actual execution of the instructions (or store any state needed to do so).

its sole focus is, then, to generate the tables done by hand.

Comment 4 Jacob Lifshay 2020-10-28 16:43:16 GMT

(In reply to Luke Kenneth Casson Leighton from comment #3)
> (In reply to Jacob Lifshay from comment #2)
> > What do you think of writing it in Rust? That would have several benefits:
> > 
> > + We can pretty easily build an interactive Web version using WebAssembly
> > and Rust's tooling for the web. No server-side stuff would be needed, other
> > than file hosting. I'm thinking type command line arguments into a text box
> > and it dynamically regenerates an html table, nothing too fancy.
> 
> the capabilities are compelling, i was also wondering if it would be
> beneficial
> to do an interactive version.

I was thinking we'd just host the interactive version on our website, making it extremely easy to use.

I was also thinking we'd have a few other CPU models included for comparison purposes, such as the standard register-renamed model, and a simple in-order pipeline.

> > + As a future extension, it could be easily integrated with
> > power-instruction-analyzer to make a full-featured cycle-accurate (not just
> > instruction accurate) simulator, where Rust's speed would be a huge
> > advantage over Python.
> 
> here's the thing jacob: as we have a low-to-medium priority short-medium
> requirement (increasing to high over the long-term) to be an "educational
> resource", rust's complex installation, learning curve and low adoption rate
> compared to a whopping 30% world-wide adoption and "single-binary-download"
> for python, it puts itself outside of the running.
> 
> imagine that LibreSOC became a study project for a secondary school, or that
> someone aged 15 with a raspberry pi wanted to learn how it worked.  just the
> installation process for rust is *so advanced* compared to python's simple
> installation that they'd literally get nothing done before giving up
> entirely.

assuming it's a recent version of Raspbian, "sudo apt install cargo" should work.

> (to give you some idea, there: my brother dan helped his daughter, from age
> 15, to complete programming assignments by using "online embedded web"
> versions of python on her chromebook)

There's an official online version of Rust: https://play.rust-lang.org/
It has the top 100 most popular crates and their dependencies installed.

> python is just "step 1 click-and-download step 2 install step 3 get source
> step 4 run script" (and for the online web versions it's even simpler than
> that)

Well, my housemate spent several hours trying to figure out how to install it on Windows, I had to go help him get pip working.

> rust is "step 1 click-and-download step 2 install step 3 wait for it to get
> and build dependencies step 3 get source step 4 compile source step 5 wait
> for it to get build dependencies step 6 run binary"

Rust is (assuming Windows) step 1 download installer, step 2 run installer following prompts, step 3 download source code, step 4 open command prompt, step 5 run "cargo run", step 6 read output.

If your on Linux it's even easier (assuming you have binutils already): step 1 run installer following prompts, step 2 download source, step 3 run cargo run, step 4 read output.

Or, if you like, just run "sudo apt install cargo" (also automatically installs binutils), download source, run "cargo run".

> 
> back to the technical requirements: cycle-accurate simulators do not take
> into account the latency or overlap inside of pipelines (which is the object
> of the exercise, here).  the assumption made by a cycle-accurate simulator
> is that the instruction "will at least complete in exactly one cycle,
> transparently and independently of the underlying micro-architecture".
> 
> i.e. a cycle-accurate simulator models enough so that binaries have the
> exact same behaviour and guarantees of hardware *but* that at the end of
> single-stepping the entire "machine state" - including memory including
> caches and VM PTEs, is 100% up-to-date (not compromised as it would be by
> some JIT emulators for example)
> 
> consequently, modelling the instruction stages and associated latency is not
> something that ever goes into a cycle-accurate simulator because doing so
> massively slows it down and unnecessarily complicates the design.

Well, I think we're using the same term to mean different things.

> bottom line is that the code for this exercise (even in rust) would be
> inappropriate for use as a "true" simulator, which is why i mentioned that
> it should not actually perform the actual execution of the instructions (or
> store any state needed to do so).

I beg to differ: having something that can quickly calculate exactly how fast a certain program (potentially very large program -- on the scale of billions of instructions executed) would run on our processor is hugely valuable, that allows us to do benchmarking to determine which algorithms work best, it also allows us to easily determine the performance implications of changing different aspects of our design without having to run a slow gate-level simulation. It also allows testing without the need to actually implement a change in nmigen, since most features are much harder to write in nmigen than in Rust.

Comment 5 Jacob Lifshay 2020-10-29 07:07:43 GMT

started writing a rust program:
https://salsa.debian.org/Kazan-team/power-cpu-sim

Chromium is currently crashing with SIGSEGV when I try to debug the compiled WebAssembly.

I'll work on this more later. If we decide that we don't want a rust program, at least I will have actually tried Rust and WebAssembly.

Comment 6 Luke Kenneth Casson Leighton 2020-10-29 09:09:00 GMT

explanation:

cycle-accurate simulator is defined by convention initially i believe by the academic community as that the state is full up-to-date even when there is a single clock cycle, one-for-one with the instruction that was executed in that clock cycle.

if the simulator takes 6 cycles to complete an instruction such that after issue and one cycle the state is checked and it has *not* been updated to reflect the changes anticipated by that just-issued instruction then this is NOT a cycle-accurate simulator, it is a HARDWARE (microarchitecturally specific) accurate simulator.

running nmigen pyrtl is such a hardware accurate simulator, and speedups are possible by using cxxsim (eventually) or verilator (most conveniently via litex) right now.

to create a cycle accurate simulator (running linux kernels like spike does) we will need a full software simulation of both a XICS interrupt controller *and* the simulation of PTEs, TLBs and a full RADIX MMU (if we don't have the MMU we have to make heavily significant patches to the ppc64 linux kernel.  that paul's 25+ years experienced team avoided doing that and added an MMU to microwatt instead should give you an indicator there)

again, note that even with an MMU the state needs to be updated immediately in the cycle that the instruction is issued, *not* 5 cycles later.  main memory reads, normally taking 100 cycles again, are *not* simulated, they are, again, assumed to complete immediately.  L1 and L2 caches are *not* simulated in a cycle accurate simulator because these are fully transparent as far as any given individual instruction is concerned.

for this exercise we have a very simple and immediate-focus set of requirements

what we need here is a very quick way to be able to continue the conversation we started about Write-after-Write hazards in an on-the-fly iterative fashion.

Comment 7 Jacob Lifshay 2020-10-30 07:09:44 GMT

As a (hopefully workable) compromise, I added support for C/C++ to the code I wrote so we can have the simulator written in C++ and all the code for interfacing with the Web using WebAssembly, formatting the output, and other stuff will be written Rust.

I added a readme containing build instructions for the web:
https://salsa.debian.org/Kazan-team/power-cpu-sim/-/blob/master/README.md

To build for non-web, you still need to follow the steps up to installing a matching version of clang.

Then, you can just run `cargo build` and the native executable is generated in `target/debug/power-cpu-sim`, or you can just run `cargo run -- <power-cpu-sim's args>`.

All C/C++ files in the c-src directory are automatically detected and built, I told cargo to track them as dependencies.

One issue with C++ is, since we want to support the web without tons of extra code, a lot of the C/C++ standard libraries aren't available. I passed the `-ffreestanding -nostdlibinc` args to clang to ensure we don't accidentally depending on a bunch of code that won't work on the web, since I'm assuming Luke will probably just test the non-web version when working on code.

We should have memory allocation support, but not a whole lot more.

Comment 8 Luke Kenneth Casson Leighton 2020-10-30 11:45:37 GMT

jacob: here's the thing.  the code that you've written makes you the sole exclusive developer.

you and i need to both be able to edit this code, very quickly and try out different experiments, add and try out new features, to check different timings.

simulation results - actual computations, actual values in actual register files - are outside of the scope of this bugreport.

this bugreport is about timing.  not be part of an actual simulation: it's about timing (one of the other options to add to generate a gtkwave file rather than a mdwn file).

it is intended to be an extremely quick way for you and i to be able to continue the discussion without a 8-hour delay on manually rewriting a hand-crafted table.

if it involves rust it introduces a minimum *two month* delay where i am forced to learn rust, on top of everything else that i'm doing.

c++ is completely outside of the scope of the requirements for this very simple project.

sorry to have to be honest about it: you're writing code that's not needed, right now, and it does not meet the very simple very straightforward and immediate requirements.

it may be *in future* that we use the code that you're writing, however as it is not what is needed or asked for, how can i justify authorising a donation to you, under this bugreport, from NLnet?

Comment 9 Luke Kenneth Casson Leighton 2020-10-30 12:00:40 GMT

btw jacob: i do understand what you're doing.  you want to make it interesting, more challenging, and satisfy an elegance that you feel would be better.

i used to do exactly the same thing, age 25 :)

then i began to realise that other people can't follow or keep up with what i was doing, and the work that i had done was consequently wasted.

we need to work together on removing the barriers to collaboration rather than adding them.

Comment 10 Jacob Lifshay 2020-11-04 06:43:25 GMT

I got a demo version of the WebAssembly program to work, for now it's available at:
https://ftp.libre-soc.org/power-cpu-sim/

You can have fun typing in all the different command line options and seeing the produced output.

I'll work on producing some useful tables next.

Comment 11 Luke Kenneth Casson Leighton 2020-11-04 11:49:53 GMT

jacob.

it takes over a minute to load.

as you add code that startup time is going to get worse and worse.

whilst it is nice to know that web assembly exists and can be done,
and is an interesting exercise and learning experience, it's not useful.

this is getting further down a rabbit hole and further and further
away from a useful tool that aids and assists in the goal of discussing
and trying out different strategies that you and i can talk about.

i'm happy for you that you're learning web assembly however please can
you consider it a "your own time activity" and move the discussion of
web assembly experimentation to libre-soc-misc, so that it does not
distract others (Jean-Paul, Staf in particular) during the critical
lead-up to Dec 2nd?

if retrospectively it turns out to be useful for a Libre-SOC-related
purpose we can come back to it and make sure that you receive a donation
for this learning and research experience.

Comment 12 Jacob Lifshay 2020-11-04 16:25:50 GMT

(In reply to Luke Kenneth Casson Leighton from comment #11)
> jacob.
> 
> it takes over a minute to load.

Probably because it includes a whole markdown renderer/formatter, it was compiled in debug mode producing a 30MB wasm file, the server doesn't serve the file using mime type application/wasm when it should (preventing your web browser from compiling the code as it downloads it, forcing it to download all of it then compile it), and the server doesn't use gzip compression when it could.

I uploaded the release-mode version which is 3MB uncompressed and 1MB gzip compressed. It can probably shrink further if I enable LTO and optimize for size.

(When uploading it over sftp it was going at 30kB/s and took like a minute for the 3MB file, not sure if it's on my end or not...)

Also, it takes <1s for it to load on my phone once the web browser has the wasm file in its cache. <10s if not cached.

> as you add code that startup time is going to get worse and worse.

It probably won't change by a whole lot since most of the code size will be used by all the support libraries that are included (markdown parser/formatter, power-instruction-analyzer, etc.)

> 
> whilst it is nice to know that web assembly exists and can be done,
> and is an interesting exercise and learning experience, it's not useful.

It can easily be compiled natively by running `cargo build` (assuming the dependencies are installed, mostly clang and libc++). It runs at full speed on the Linux command line, no web browser necessary.

I'll add build instructions to the readme.

> this is getting further down a rabbit hole and further and further
> away from a useful tool that aids and assists in the goal of discussing
> and trying out different strategies that you and i can talk about.

Now that I got the wasm stuff working, next I'll be writing the microarchitectural simulation table generator in C++ (probably today). You should be able to work on that part, if you like, since I know you know C++.

> i'm happy for you that you're learning web assembly however please can
> you consider it a "your own time activity" and move the discussion of
> web assembly experimentation to libre-soc-misc, so that it does not
> distract others (Jean-Paul, Staf in particular) during the critical
> lead-up to Dec 2nd?

tried to move to libre-soc-misc, bugzilla didn't accept the CC email address.

Comment 13 Luke Kenneth Casson Leighton 2020-11-04 17:24:22 GMT

(In reply to Jacob Lifshay from comment #12)

> It can easily be compiled natively by running `cargo build`

jacob: it's rust.  to reiterate, because i did not receive any indication
that you've read the comments above: *i cannot take the time to learn rust*
which means *you become the sole exclusive bottleneck for critical research
in this area*.

if i stop absolutely everything that i am doing and take the 2 to 6 months
needed to become proficient at rust *it jeapordises the timeline for the
entire project*.

there are only two of us full-time at the moment.  therefore *we cannot use
rust for this critical research*.

i'm sorry to have to emphasise this: i have repeated this multiple times
throughout this bugreport and multiple times before that: i know you like rust, 
and you want to use it wherever you can, however the other people in the
group *cannot use it*.

once you have completed the work that you are doing - in rust - we have
to *throw it away* and* entirely start again* in a programming language that
is common to everyone.

and it would be completely unfair to request a payment from NLnet for duplicated
work, so the rust code - for which you are the exclusive developer - has to
take second priority.

regarding libre-soc-misc: i mean "move the posting of discussion of
rust and web-assembly off of this bugreport and use direct posting
on libre-soc-misc"

if there is anything that you do not understand or agree with please do say
so and let's discuss it ok?  is there anything above or in any of the comments
that is logically unsound or un-reason-able, that you can see?  or, is there
any alternatives that allow collaborative development on the requirements
in a rapid prototyping fashion on a fast turnaround timescale?

Comment 14 Cole Poirier 2020-11-04 17:32:01 GMT

(In reply to Luke Kenneth Casson Leighton from comment #13)
>
> once you have completed the work that you are doing - in rust - we have
> to *throw it away* and* entirely start again* in a programming language that
> is common to everyone.

I think that Jacob is writing it in C++ specifically so that you can participate. I'm not sure if you saw his comment about that. I think perhaps he's using the rust build system because it's much easier to compile c++ into WASM using that. Correct me if I'm wrong Jacob.

Comment 15 Luke Kenneth Casson Leighton 2020-11-04 18:31:05 GMT

(In reply to Cole Poirier from comment #14)
> (In reply to Luke Kenneth Casson Leighton from comment #13)
> >
> > once you have completed the work that you are doing - in rust - we have
> > to *throw it away* and* entirely start again* in a programming language that
> > is common to everyone.
> 
> I think that Jacob is writing it in C++ specifically so that you can
> participate. I'm not sure if you saw his comment about that. I think perhaps
> he's using the rust build system because it's much easier to compile c++
> into WASM using that. Correct me if I'm wrong Jacob.

well... proposing that *before* going ahead with the actual work is where i
am getting... annoyed (i'm not, because this is a valuable learning exercise.
the annoyance, if any, comes in that this is becoming a major distraction,
where the work could have been done and completed in the time it's taken
to experiment with wasm).

i also wanted to be able to generate gtkwave files and for that we can bring
in Cesar's excellent work (to save time), and haven't been able to get round
to doing so.

we need to save time, here.  not on the speed of execution, the *development*
time.

Comment 16 Cole Poirier 2020-11-04 18:54:04 GMT

(In reply to Luke Kenneth Casson Leighton from comment #15)
>
> well... proposing that *before* going ahead with the actual work is where i
> am getting... annoyed (i'm not, because this is a valuable learning exercise.
> the annoyance, if any, comes in that this is becoming a major distraction,
> where the work could have been done and completed in the time it's taken
> to experiment with wasm).
> 
> i also wanted to be able to generate gtkwave files and for that we can bring
> in Cesar's excellent work (to save time), and haven't been able to get round
> to doing so.
> 
> we need to save time, here.  not on the speed of execution, the *development*
> time.

Yes, I understand.

Comment 17 Luke Kenneth Casson Leighton 2020-11-11 10:32:33 GMT

the purpose of this bugreport was to aid and assist in evaluating register allocation strategies, saving considerable time in the hand-writing (4+ hours) of markdown tables through writing extremely rapid prototype and throw-away one-off python code.

the idea to develop near-gate-level simulations far beyond cycle-accurate simulations is a hundred to a thousand times longer development time than the expected scope of this bugreport, and cannot be considered "documentation"

consequently i am closing it and returning the budget back to the parent "Documentation" bugreport.

Comment 18 Luke Kenneth Casson Leighton 2020-11-11 10:33:38 GMT

Comment 19 Jacob Lifshay 2020-11-12 08:37:02 GMT

(In reply to Luke Kenneth Casson Leighton from comment #17)
> the purpose of this bugreport was to aid and assist in evaluating register
> allocation strategies, saving considerable time in the hand-writing (4+
> hours) of markdown tables through writing extremely rapid prototype and
> throw-away one-off python code.

Ok, I had missed that the whole point was to have throw-away code. I think you are greatly underestimating the complexity of code required to generate those markdown tables currently on the wiki, I'd estimate something around 2-5k lines of python would have been required. It currently has 2.2k lines of C++ (counting files in c-src but not c-src/c++), and it implements the simulator, fetch pipeline, register renamer (I started with that model since I understand it better), and 5 instructions (addi, bdnz, ldu, mtctr, and std).

> the idea to develop near-gate-level simulations far beyond cycle-accurate
> simulations is a hundred to a thousand times longer development time than
> the expected scope of this bugreport, and cannot be considered
> "documentation"

I agree that a simulator would probably not be documentation, however, the additional code required beyond the code needed to generate those markdown tables is not very much, I'd estimate 1.5-2x as much code (if implementing just the same instructions). Therefore, I decided to write the code in such a way that it is feasible to extend later.

> consequently i am closing it and returning the budget back to the parent
> "Documentation" bugreport.

Fine by me, we can figure that out later, if we decide that the code is useful for documentation, or other stuff.

Some demo output (cleaned up slightly):

|Cycle|0|1|2|3|4|5|6|7|8|
|-|-|-|-|-|-|-|-|-|-|
|0x100: mtctr r4||Fetch|Renamed: mtctr h2 \<- h1|||||||
|0x104: ldu r9, 8(r3)||Fetch|Renamed: ldu h3, 8(h0 -> h4)|||||||
|0x108: addi r9 \<- r9, 100||Fetch|Renamed: addi h5 \<- h3, 100|||||||
|0x10c: std r9, 0(r3)||Fetch|Renamed: std h5, 0(h4)|||||||
|0x110: bdnz .L2|||Fetch|Renamed: bdnz h6 \<- h2, .L2||||||
|0x104: ldu r9, 8(r3)||||Fetch|Renamed: ldu h7, 8(h4 -> h8)|||||
|0x108: addi r9 \<- r9, 100||||Fetch|Renamed: addi h9 \<- h7, 100|||||
|0x10c: std r9, 0(r3)||||Fetch|Renamed: std h9, 0(h8)|||||
|0x110: bdnz .L2||||Fetch|Renamed: bdnz h10 \<- h6, .L2|||||
|0x104: ldu r9, 8(r3)|||||Fetch|Renamed: ldu h11, 8(h8 -> h12)||||
|0x108: addi r9 \<- r9, 100|||||Fetch|Renamed: addi h13 \<- h11, 100||||
|0x10c: std r9, 0(r3)|||||Fetch|Renamed: std h13, 0(h12)||||
|0x110: bdnz .L2|||||Fetch|Renamed: bdnz h14 \<- h10, .L2||||
|0x104: ldu r9, 8(r3)||||||Fetch|Renamed: ldu h15, 8(h12 -> h0)|||
|0x108: addi r9 \<- r9, 100||||||Fetch|Renamed: addi h3 \<- h15, 100|||
|0x10c: std r9, 0(r3)||||||Fetch|Renamed: std h3, 0(h0)|||
|0x110: bdnz .L2||||||Fetch|Renamed: bdnz h5 \<- h14, .L2|||
|0x104: ldu r9, 8(r3)|||||||Fetch|Renamed: ldu h2, 8(h0 -> h4)||
|0x108: addi r9 \<- r9, 100|||||||Fetch|Renamed: addi h7 \<- h2, 100||
|0x10c: std r9, 0(r3)|||||||Fetch|Renamed: std h7, 0(h4)||
|0x110: bdnz .L2|||||||Fetch|Renamed: bdnz h9 \<- h5, .L2||
|0x104: ldu r9, 8(r3)||||||||Fetch|Renamed: ldu h6, 8(h4 -> h8)|
|0x108: addi r9 \<- r9, 100||||||||Fetch|Renamed: addi h11 \<- h6, 100|
|0x10c: std r9, 0(r3)||||||||Fetch|Renamed: std h11, 0(h8)|
|0x110: bdnz .L2||||||||Fetch|Renamed: bdnz h13 \<- h9, .L2|

Comment 20 Luke Kenneth Casson Leighton 2020-11-12 12:19:01 GMT

(In reply to Jacob Lifshay from comment #19)
> (In reply to Luke Kenneth Casson Leighton from comment #17)
> > the purpose of this bugreport was to aid and assist in evaluating register
> > allocation strategies, saving considerable time in the hand-writing (4+
> > hours) of markdown tables through writing extremely rapid prototype and
> > throw-away one-off python code.
> 
> Ok, I had missed that the whole point was to have throw-away code. 

the purpose is to help speed up the discussion surrounding reg-renaming.
something that takes 10 days to write has in no way speded up the
discussion or allowed us to assess and test alternative strategies on a
one-day cycle.

the moment that code-reuse and additional purposes (such as full 
hardware-level pipeline simulation) is involved, the complexity of the
resultant code jumps massively, and as such can no longer be considered
useful as "documentation".


> I think
> you are greatly underestimating the complexity of code required to generate
> those markdown tables currently on the wiki, I'd estimate something around
> 2-5k lines of python would have been required. 

to do a full hardware-level pipeline simulation?  most likely yes.

to do something that only tracks dependencies and does not even store,
process, or compute the operands?  i would be extremely surprised. i was
expecting to literally use a dictionary for the rename table, a list
for pipelines using push and pop.  reformat the instruction "assembly"
so that it takes 2-3 lines to "process" and so on, but is still
human-readable.

... but this is literally the first time that estimates of time taken
(based on LoC) is being discussed!

this should have been discussed *around comment 3*.

we could then have gone, "hmmm that's not worth it let's just carry on
doing the hand-drawn tables" or "hmm are we sure about that, let's maybe
take the risk".

and made that decision *together*


> It currently has 2.2k lines
> of C++ (counting files in c-src but not c-src/c++), and it implements the
> simulator, fetch pipeline, register renamer (I started with that model since
> I understand it better), and 5 instructions (addi, bdnz, ldu, mtctr, and
> std).
> 
> > the idea to develop near-gate-level simulations far beyond cycle-accurate
> > simulations is a hundred to a thousand times longer development time than
> > the expected scope of this bugreport, and cannot be considered
> > "documentation"
> 
> I agree that a simulator would probably not be documentation, however, the
> additional code required beyond the code needed to generate those markdown
> tables is not very much, I'd estimate 1.5-2x as much code (if implementing
> just the same instructions). Therefore, I decided to write the code in such
> a way that it is feasible to extend later.

which you're only just now mentioning, which means it wasn't discussed
or agreed.

it may be a good idea... but it needed to be discussed.

> > consequently i am closing it and returning the budget back to the parent
> > "Documentation" bugreport.
> 
> Fine by me, we can figure that out later, if we decide that the code is
> useful for documentation, or other stuff.

at 2,500 lines and in c++ it puts it far beyond the reach of even an
above-average 15 to 17-year-old student owning a raspberry pi, and many
Computer Science undergraduates would struggle as well.

anything under the "Documentation" budget *has to itself actually be
documentation*.

again - you went ahead *without discussing it*, and that's really the
key message here.

Comment 21 Luke Kenneth Casson Leighton 2020-11-12 12:26:28 GMT

so.  for future, what would save time and keep us on target would be to
go through the iterative evaluation process, keeping in touch and discussing
the approach to take, before going ahead.

now, i don't mind - at all - "i am going to do this because i want to and
because i will enjoy it" - but i would expect you to *say* that, because
then it can be taken into consideration and everyone knows what's going on
and what to expect.

Comment 22 Jacob Lifshay 2020-11-13 05:32:03 GMT

(In reply to Luke Kenneth Casson Leighton from comment #20)
> (In reply to Jacob Lifshay from comment #19)
> > (In reply to Luke Kenneth Casson Leighton from comment #17)
> > > the purpose of this bugreport was to aid and assist in evaluating register
> > > allocation strategies, saving considerable time in the hand-writing (4+
> > > hours) of markdown tables through writing extremely rapid prototype and
> > > throw-away one-off python code.
> > 
> > Ok, I had missed that the whole point was to have throw-away code. 
> 
> the purpose is to help speed up the discussion surrounding reg-renaming.
> something that takes 10 days to write has in no way speded up the
> discussion or allowed us to assess and test alternative strategies on a
> one-day cycle.

Once it is more complete, it should be possible to rapidly modify it to test different algorithms.

> the moment that code-reuse and additional purposes (such as full 
> hardware-level pipeline simulation) is involved, the complexity of the
> resultant code jumps massively, and as such can no longer be considered
> useful as "documentation".

In my opinion it is still effective documentation because it can be interactively used in a web browser (assuming the web server can transmit faster than the 2Mbit/s it's currently stuck at and/or enables support for gzip compression). Once more parts of the algorithms can be adjusted via command line parameters, it should be quite useful. This is similar to using circuitjs to document circuits -- people look at the UI, not read the source code.

> ... but this is literally the first time that estimates of time taken
> (based on LoC) is being discussed!
> 
> this should have been discussed *around comment 3*.

Yeah, I get that now.

Comment 23 Cesar Strauss 2020-11-13 10:47:38 GMT

I suggest also comparing hazard avoidance performance in a Tomasulo-based architecture.

By the Tomasulo transformation, I guess, it should be equivalent to the Augmented 6600 Scoreboard (https://libre-soc.org/3d_gpu/architecture/tomasulo_transformation/)

I found interesting examples of multi-issue Tomasulo architectures at https://www.brown.edu/Departments/Engineering/Courses/En164/Tomasulo_10.pdf
(I just added it to the Resource page).

It has some execution traces showing hazard avoidance. It also shows how register renaming occurs implicitly in the reservation stations (in the second example, it happens in the reorder buffer instead).

Comment 24 Luke Kenneth Casson Leighton 2020-11-13 13:04:05 GMT

(In reply to Cesar Strauss from comment #23)
> I suggest also comparing hazard avoidance performance in a Tomasulo-based
> architecture.

ah good idea.

> By the Tomasulo transformation, I guess, it should be equivalent to the
> Augmented 6600 Scoreboard
> (https://libre-soc.org/3d_gpu/architecture/tomasulo_transformation/)

yes except i realised that this page needs updating to include WaW renaming.  making it way more complex. sigh.


> I found interesting examples of multi-issue Tomasulo architectures at
> https://www.brown.edu/Departments/Engineering/Courses/En164/Tomasulo_10.pdf
> (I just added it to the Resource page).

thank you, it is very useful.


> It has some execution traces showing hazard avoidance. It also shows how
> register renaming occurs implicitly in the reservation stations (in the
> second example, it happens in the reorder buffer instead).


"The Reorder Buffer: The ROB is a small multi-ported SRAM that holds the results of completed computation from the execution units until it is safe to commit them to the architectural register file."

this isn't quite accurate: it's a multi-ported CAM which is disastrously expensive: multiple banks (one per port) of XOR gates per row, all of which can fire in every cycle.

also the multi-issue buses mentioned? contention on the CDBs? lootta fun, very costly.  mind you the same cost exists in scoreboards, it's just that the OpFwd Bus is separated from the reg read/write Buses, where the CDB(s) they are rolled into one ("Common")

it is still a good idea.

Comment 25 Luke Kenneth Casson Leighton 2020-11-13 13:15:23 GMT

(In reply to Jacob Lifshay from comment #22)

> Once it is more complete, it should be possible to rapidly modify it to test
> different algorithms.

implication being that the complexity goes up as a result (adding "maintenance burden" to a list of tasks that we should never have had in the first place).

rather than being able to switch mental context to a different algorithm we become burdened with maintenance of the older algorithms, some of which we may have abandoned.

whereas short and compact *separate* throwaway programs are so short that duplication is perfectly acceptable; alterations do not affect the copy; if we decide to revisit an old abandoned algorithm we continue where it left off, confident that it has not been destroyed by newer ones.

code-reuse techniques are appropriate when there is an intention to maintain the code.  we have no intention of maintaining this code or using it beyond the immediate exploratory purpose.  its sole purpose is to capture an algorithm (saving time by not having to hand-draw 4 hour tables), and, once we have the insights we need, forgetting about them.

Comment 26 Luke Kenneth Casson Leighton 2020-11-13 14:34:23 GMT

and - and this is the kicker - once an evaluated algorithm has been chosen (using multiple throwaway 1/2 day prototypes), *then* we decide whether to implement it in more depth (taking 10+ days) *if that is even needed*.

basically it's all about time-saving.