Bug 92 - Implement in order instruction retire refcounting
Summary: Implement in order instruction retire refcounting
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: Other Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 81
  Show dependency treegraph
 
Reported: 2019-06-05 03:02 BST by Luke Kenneth Casson Leighton
Modified: 2022-06-16 14:53 BST (History)
1 user (show)

See Also:
NLnet milestone: NLnet.2019.02.012
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2019-06-05 03:02:46 BST
Shadows need to be cast across instructions, to preserve instruction order, however it is quite detailed and needs two refcounts, one for registers to protect inorder writes to the regfile, and another for the MemRefs. Also if using CSR Dependency Management, one may be needed there, too.

Detailed writeup here

http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-June/001660.html
Comment 1 Luke Kenneth Casson Leighton 2019-06-06 06:46:27 BST
The proposed idea has a problem: the counter compares are effectively CAM compares. Multiple AND gates to check that the high bits are nonzero.

This for every single FU, it is just too expensive.

An alternative idea is to preallocate, at instruction issue time, the register write port to which the instruction will commit its result.

This by having a separate set of shadows for the write ports.

This unfortunately has the side effect of reducing parallelism as the instructions may complete well before each other, and, if allowed to commit via another port, clearly would allow forward progress that was otherwise missed.

Hopefully operand forwarding will mitigate this somewhat.

There must be better alternatives.

Perhaps even whilst the allocations are not to ports per se, they may be at least identification of "candidates for selection to commit to any port".

Or, perhaps, it is not a problem after all. Instruction order is to be preserved. If one shadow bank is not ready, allowing another to commit would mean getting out of order.

Needs more thought.
Comment 2 Luke Kenneth Casson Leighton 2019-06-06 07:03:33 BST
Documenting alternative idea
Inter bank shadows and intra bank shadows.

Intra bank is stripes, inter bank is dropped only when last numbered intra bank shadow is dropped.
Comment 3 Luke Kenneth Casson Leighton 2019-06-06 22:16:26 BST
Another possible algorithm:
* round robin insertion into shadow stripes
* round robin examination of shadow stripe retirement.

The retirement index of which shadow stripe has no shadows left, as it is round robin, may correspond directly with the write port to the regfile.

This in turn would mean that where previously it would have been necessary to have a recursive multi priority picker, it might be possible to have simpler straight unary pickers.

Maybe.

The round robin retirement selector will need to be capable of recognising multiple simultaneous retirements.
Comment 4 Luke Kenneth Casson Leighton 2019-06-06 22:31:11 BST
bitsmod[i]= retiring[(i + idx)mod 4]

if bitsmod[0] and ~bitsmod[1]:
   idx_inc = 1
elif bitsmod[0] and bitsmod[1] and ~bitsmod[2]:
   idx_inc = 2
elif bitsmod[0] and bitsmod[1] and 
 bitsmod[2] and ~bitsmod[3]:
   idx_inc = 3
elif allset:
    idx_inc= 4# to be modulo 4

Nonzero test on idx_inc determines if commit goes ahead.

Actual inc on round robin is idx_inc mod 4

In this way we know that out of 4 possible commits rhe order will be maintained, plus we can detect up to 4 simultaneous commits in one cycle.

The multi priority picker may still be needed because we still do not know which FU is writable. Unless... the shadow stripes are only presenting one per port?

This might actually work!
Comment 6 Luke Kenneth Casson Leighton 2019-06-19 13:08:24 BST
An augmentation of this concept, if the number of instructions issued does not equal the number of write ports.

The round robin index to be inserted into each stripe can be modulo the number of write ports.

2 instructions issued, 6 stripes, 4 write ports, the target write port numbers to use are 0 into stripe 0, 1 into stripe 1.

On the next clock, 4 issued: 2 into stripe 2, 3 into stripe 3, 0 into stripe 4, 1 into stripe 5.

And do on.