16 – decide L1 and L2 cache sizes

Bug 16 - decide L1 and L2 cache sizes

Summary: decide L1 and L2 cache sizes

Status:	CONFIRMED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Specification (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Luke Kenneth Casson Leighton

URL:

Depends on:
Blocks:

Reported:	2018-03-15 13:45 GMT by Luke Kenneth Casson Leighton
Modified:	2020-08-19 00:54 BST (History)
CC List:	2 users (show)

See Also:
NLnet milestone:	---
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2018-03-15 13:45:04 GMT

needs a detailed analysis, to decide cache sizes.
also need to calculate die area utilised.

Comment 1 Madhu 2018-03-16 07:35:30 GMT

Default Config now is 32 KB D and 32 KB I cache. This will work unless a micro controller class device is needed. For heavier multi-tasking larger I cache can be considered but unlikely for our target segment. Associativity also to be decided. 

L2 of 256-512KB should be sufficient for signgle core  but will be low for a quad or octa core. May need 2-4 MB.  Also shared vs dedicated L2 has to be decided, for proposed workloads, shared L2 should be sufficient.

Comment 2 Luke Kenneth Casson Leighton 2018-03-17 11:11:37 GMT

thanks madhu.  sorry i forgot to add details for this one, m-class,
so yes 8 cores (more if some are dedicated to 3D and/or Video).
which reminds me, the memory bandwidth must cover two 1080p60 framebuffers
(355 mbytes/sec each @ 24bpp) and also enough for the MPEG / MP4/ H.264/5
etc. video decoding.

the vga_lcd opencores video engine has a built-in FIFO buffer to
cover irregularities in memory read/write accesses, i understand
though that the peripherals will still need to go through the
(shared) L2 cache.

2-4 MB is a bit... big, isn't it? (understatement: it's eeennormous,
from what i understand of cache sizes)

i'm just looking up for example the i7 cache sizes: L1: 64k dedicated,
L2 256k dedicated, L3 4-24 MB and that's definitely enormous.

ARM Cortex A57 is L1 48k dedicated instruction 3-way associative,
32k dedicated data 2-way associative, L2 512k - 2MB 16-way set associative,
but only up to 4 cores there.

so if ARM's strategy is anything to go by, what you propose sounds
perfectly sane for an 8-core+ SMP SoC.

Comment 3 Madhu 2018-03-17 11:42:29 GMT

Most quad - octa core is in the shared 1-4 MB range, lower values tend to be used in networking scenarios where locality of reference is low. Preferable config is probably 2 core clusters with 512K shared between two cores. Can drop that to 256KB if size is an issue. Makes cache coherency more painful but routing I think will be easier.

Comment 4 Luke Kenneth Casson Leighton 2018-03-17 12:23:26 GMT

(In reply to Madhu from comment #3)

> Most quad - octa core is in the shared 1-4 MB range, lower values tend to be
> used in networking scenarios where locality of reference is low.

ahh ok.  so we need higher.

> Preferable
> config is probably 2 core clusters with 512K shared between two cores. Can
> drop that to 256KB if size is an issue. Makes cache coherency more painful
> but routing I think will be easier.

ok cool.

Comment 5 Luke Kenneth Casson Leighton 2018-03-21 09:46:02 GMT

http://people.csail.mit.edu/beckmann/publications/tech.../grain_size_tr_feb_2010.pdf

i saw this and thought it might be useful, he says that DRAM cells for
the L2 cache win out every time.

Comment 6 Luke Kenneth Casson Leighton 2018-03-27 08:35:08 BST

https://github.com/jbush001/NyuziProcessor/wiki/Microarchitecture

in the cache arrangement there, it's quite neat: very simple, and
needs a manual intervention assembly instruction to retain synchronisation.

if splitting down into independent L2 caches (one per core pair) how
would those be synchronised? 4 buses which communicate address writes
to each of the other L2 caches, to detect when write clashes occur?

btw interestingly in speaking with alan cox some time ago he mentioned
that actually, linux SMP kernels really do not have much in the way
of clashes.  hardware spinlocks might be useful, those are the only
real places where SMP is used in the linux kernel.