92 – Implement in order instruction retire refcounting

Bug 92 - Implement in order instruction retire refcounting

Summary: Implement in order instruction retire refcounting

Status:	CONFIRMED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Source Code (show other bugs)
Version:	unspecified
Hardware:	Other Linux

Importance:	--- enhancement
Assignee:	Luke Kenneth Casson Leighton

URL:

Depends on:
Blocks:	81
	Show dependency tree / graph

Reported:	2019-06-05 03:02 BST by Luke Kenneth Casson Leighton
Modified:	2023-02-13 10:31 GMT (History)
CC List:	1 user (show)

See Also:	737
NLnet milestone:	NLnet.2019.02.012
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2019-06-05 03:02:46 BST

Shadows need to be cast across instructions, to preserve instruction order, however it is quite detailed and needs two refcounts, one for registers to protect inorder writes to the regfile, and another for the MemRefs. Also if using CSR Dependency Management, one may be needed there, too.

Detailed writeup here

http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-June/001660.html

Comment 1 Luke Kenneth Casson Leighton 2019-06-06 06:46:27 BST

The proposed idea has a problem: the counter compares are effectively CAM compares. Multiple AND gates to check that the high bits are nonzero.

This for every single FU, it is just too expensive.

An alternative idea is to preallocate, at instruction issue time, the register write port to which the instruction will commit its result.

This by having a separate set of shadows for the write ports.

This unfortunately has the side effect of reducing parallelism as the instructions may complete well before each other, and, if allowed to commit via another port, clearly would allow forward progress that was otherwise missed.

Hopefully operand forwarding will mitigate this somewhat.

There must be better alternatives.

Perhaps even whilst the allocations are not to ports per se, they may be at least identification of "candidates for selection to commit to any port".

Or, perhaps, it is not a problem after all. Instruction order is to be preserved. If one shadow bank is not ready, allowing another to commit would mean getting out of order.

Needs more thought.

Comment 2 Luke Kenneth Casson Leighton 2019-06-06 07:03:33 BST

Documenting alternative idea
Inter bank shadows and intra bank shadows.

Intra bank is stripes, inter bank is dropped only when last numbered intra bank shadow is dropped.

Comment 3 Luke Kenneth Casson Leighton 2019-06-06 22:16:26 BST

Another possible algorithm:
* round robin insertion into shadow stripes
* round robin examination of shadow stripe retirement.

The retirement index of which shadow stripe has no shadows left, as it is round robin, may correspond directly with the write port to the regfile.

This in turn would mean that where previously it would have been necessary to have a recursive multi priority picker, it might be possible to have simpler straight unary pickers.

Maybe.

The round robin retirement selector will need to be capable of recognising multiple simultaneous retirements.

Comment 4 Luke Kenneth Casson Leighton 2019-06-06 22:31:11 BST

bitsmod[i]= retiring[(i + idx)mod 4]

if bitsmod[0] and ~bitsmod[1]:
   idx_inc = 1
elif bitsmod[0] and bitsmod[1] and ~bitsmod[2]:
   idx_inc = 2
elif bitsmod[0] and bitsmod[1] and 
 bitsmod[2] and ~bitsmod[3]:
   idx_inc = 3
elif allset:
    idx_inc= 4# to be modulo 4

Nonzero test on idx_inc determines if commit goes ahead.

Actual inc on round robin is idx_inc mod 4

In this way we know that out of 4 possible commits rhe order will be maintained, plus we can detect up to 4 simultaneous commits in one cycle.

The multi priority picker may still be needed because we still do not know which FU is writable. Unless... the shadow stripes are only presenting one per port?

This might actually work!

Comment 5 Luke Kenneth Casson Leighton 2019-06-06 23:08:34 BST

http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-June/001705.html

Comment 6 Luke Kenneth Casson Leighton 2019-06-19 13:08:24 BST

An augmentation of this concept, if the number of instructions issued does not equal the number of write ports.

The round robin index to be inserted into each stripe can be modulo the number of write ports.

2 instructions issued, 6 stripes, 4 write ports, the target write port numbers to use are 0 into stripe 0, 1 into stripe 1.

On the next clock, 4 issued: 2 into stripe 2, 3 into stripe 3, 0 into stripe 4, 1 into stripe 5.

And do on.