needs a detailed analysis, to decide cache sizes. also need to calculate die area utilised.
Default Config now is 32 KB D and 32 KB I cache. This will work unless a micro controller class device is needed. For heavier multi-tasking larger I cache can be considered but unlikely for our target segment. Associativity also to be decided. L2 of 256-512KB should be sufficient for signgle core but will be low for a quad or octa core. May need 2-4 MB. Also shared vs dedicated L2 has to be decided, for proposed workloads, shared L2 should be sufficient.
thanks madhu. sorry i forgot to add details for this one, m-class, so yes 8 cores (more if some are dedicated to 3D and/or Video). which reminds me, the memory bandwidth must cover two 1080p60 framebuffers (355 mbytes/sec each @ 24bpp) and also enough for the MPEG / MP4/ H.264/5 etc. video decoding. the vga_lcd opencores video engine has a built-in FIFO buffer to cover irregularities in memory read/write accesses, i understand though that the peripherals will still need to go through the (shared) L2 cache. 2-4 MB is a bit... big, isn't it? (understatement: it's eeennormous, from what i understand of cache sizes) i'm just looking up for example the i7 cache sizes: L1: 64k dedicated, L2 256k dedicated, L3 4-24 MB and that's definitely enormous. ARM Cortex A57 is L1 48k dedicated instruction 3-way associative, 32k dedicated data 2-way associative, L2 512k - 2MB 16-way set associative, but only up to 4 cores there. so if ARM's strategy is anything to go by, what you propose sounds perfectly sane for an 8-core+ SMP SoC.
Most quad - octa core is in the shared 1-4 MB range, lower values tend to be used in networking scenarios where locality of reference is low. Preferable config is probably 2 core clusters with 512K shared between two cores. Can drop that to 256KB if size is an issue. Makes cache coherency more painful but routing I think will be easier.
(In reply to Madhu from comment #3) > Most quad - octa core is in the shared 1-4 MB range, lower values tend to be > used in networking scenarios where locality of reference is low. ahh ok. so we need higher. > Preferable > config is probably 2 core clusters with 512K shared between two cores. Can > drop that to 256KB if size is an issue. Makes cache coherency more painful > but routing I think will be easier. ok cool.
http://people.csail.mit.edu/beckmann/publications/tech.../grain_size_tr_feb_2010.pdf i saw this and thought it might be useful, he says that DRAM cells for the L2 cache win out every time.
https://github.com/jbush001/NyuziProcessor/wiki/Microarchitecture in the cache arrangement there, it's quite neat: very simple, and needs a manual intervention assembly instruction to retain synchronisation. if splitting down into independent L2 caches (one per core pair) how would those be synchronised? 4 buses which communicate address writes to each of the other L2 caches, to detect when write clashes occur? btw interestingly in speaking with alan cox some time ago he mentioned that actually, linux SMP kernels really do not have much in the way of clashes. hardware spinlocks might be useful, those are the only real places where SMP is used in the linux kernel.