Bug 376 - Assess 40/45 nm 2022 target and interfaces
Summary: Assess 40/45 nm 2022 target and interfaces
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Milestones (show other bugs)
Version: unspecified
Hardware: PC Mac OS
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks:
 
Reported: 2020-06-12 14:30 BST by Luke Kenneth Casson Leighton
Modified: 2020-06-15 18:46 BST (History)
6 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-06-12 14:30:24 BST
we need to assess the interfaces and capabilities for the 2022 target ASIC
https://libre-soc.org/45nm_Fall2022/
Comment 1 Luke Kenneth Casson Leighton 2020-06-12 14:31:43 BST
from discussion at http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-June/008021.html

action points (edit and add as appropriate):

* to make a decision as to what top level of visual performance is needed (maximum resolution, maximum framerate)
* calculate the framebuffer data transfer rate for that.
* to decide some top level of GFLOPs and MTriangles/sec (and other waffly figures for the GPU side)
* estimate the memory bandwidth for the GPU side
* decide if we want to develop a hardware compression algorithm between memory and framebuffer (an augmentation of Richard Herveille's RGBTTL framebuffer RTL)
* find out the power consumption in 45nm for these SERDES
* talk to Rudi to see if he can do a SERDES PHY for us
* talk to SymbioticEDA likewise
* talk to Dmitri from LIP6.fr to see what he can do, whether a stable 50 GHz clock is achievable in 45nm (i understand from Dmitri that for a given target clock - 25 gbit/sec in this case - you need the PLL to do double the clockrate)
Comment 2 Yehowshua 2020-06-12 14:52:49 BST
> 4k is 1200 megabytes per second: 37.5% *percent* of the total memory
> bandwidth consumed by a 4k 60fps framebuffer!  and if people wanted
> 120 fps instead, that's now a staggering 75% of a single OMI interface
> dedicated to the framebuffer!

Raptor was saying that they were really looking for 1080p @60fps.
So we shouldn't be worried about 4k for now.
Comment 3 Luke Kenneth Casson Leighton 2020-06-12 16:14:05 BST
(In reply to Yehowshua from comment #2)
> > 4k is 1200 megabytes per second: 37.5% *percent* of the total memory
> > bandwidth consumed by a 4k 60fps framebuffer!  and if people wanted
> > 120 fps instead, that's now a staggering 75% of a single OMI interface
> > dedicated to the framebuffer!
> 
> Raptor was saying that they were really looking for 1080p @60fps.
> So we shouldn't be worried about 4k for now.

ok great.  so that's only (!) 500 mbytes / sec.  just to check:
>>> 1920*1080*4
8294400
>>> 1920*1080*4*60
497664000
>>> 1920*1080*4*60/1e6
497.664

500*8 = 4 gigabits / sec.  so any PCIe (or other SERDES) would need to be
a minimum 4 GHz just to transfer the memory required to keep the framebuffer
100% occupied.

i am inclined to seriously recommend at least two separate OMI interfaces
(with the option to use them in parallel), one to be connected to a video
DRAM chip, the other to be connected to a "OS / GPU" memory DRAM chip,
with the option to say "i don't care about that, just use one or both for
general shared memory"

in SBC circumstances, where power is a concern and speed is not, a single
lane would be used, to a single OMI chip.

in SBC circumstances where power is not a concern and speed is, but there
is a requirement for higher-performance general-purpose CPU/GPU workloads
without demanding video, the two lanes would be dedicated to talk in parallel
to a single dual-ported OMI chip

*or*

the two lanes would be connected to *striped* memory (interleave odd/even
banks using low-bits of the address), across two separate OMI DRAMs

in the "video card" scenario, the two lanes would again connect to separate
OMI DRAMs, *however*, the framebuffer would *specifically* be mapped to
*only* one of those OMI DRAMs, and the OS memory specifically mapped to
the *other* OMI DRAM.

supporting this would need some quite complex dynamic Wishbone Bus Arbiters,
with an awful lot of MUXes on the address and data lines.  great care will
be needed to ensure that latency is not introduced.
Comment 4 Cole Poirier 2020-06-12 16:27:20 BST
(In reply to Luke Kenneth Casson Leighton from comment #3) 
> i am inclined to seriously recommend at least two separate OMI interfaces
> (with the option to use them in parallel), one to be connected to a video
> DRAM chip, the other to be connected to a "OS / GPU" memory DRAM chip,
> with the option to say "i don't care about that, just use one or both for
> general shared memory"

I think this is an especially good idea given the performance loss APUs usually suffer from using system RAM. I think it also makes sense to have the faster wider memory bus for the graphics data so we can have a less power hungry 'general' data bus (as in no 3D graphics, or anything requiring MASSIVE data paths, and therefore lots of power hungry SERDES). Is this correct?

> in SBC circumstances, where power is a concern and speed is not, a single
> lane would be used, to a single OMI chip.
[snip]
> in the "video card" scenario, the two lanes would again connect to separate
> OMI DRAMs, *however*, the framebuffer would *specifically* be mapped to
> *only* one of those OMI DRAMs, and the OS memory specifically mapped to
> the *other* OMI DRAM.

I think this reconfiguration is key to expanding the market segments the chip is suitable for use in.

> supporting this would need some quite complex dynamic Wishbone Bus Arbiters,
> with an awful lot of MUXes on the address and data lines.  great care will
> be needed to ensure that latency is not introduced.

Should be quite an interesting and very complex challenge, hopefully for some new recruits and not just you to do Luke.
Comment 5 Luke Kenneth Casson Leighton 2020-06-12 17:17:26 BST

> On Friday, June 12, 2020, Cole Poirier <colepoirier@gmail.com> wrote:
> On Jun 12 2020, at 7:06 am, Hendrik Boom <hendrik@topoi.pooq.com> wrote:
> > It's just possibe that we may not *need* 4k at 120 fps.
> > Certainly there are many potential applications for our chip that
> > don't need that kind of video. 

> I concur. If we are planning on selling 100 million units, shouldn't we
> have several different levels of IO/GPU capablilty, each targeting
> different power requirements? For example, a phone or a tablet doesn't
need 4k 120HZ video right?

correct.  this is what the original quad core was intended for.  RGBTTL connecting directly to an 800x600 low cost LCD, or, via a TI SN75LVDS83b, to a 1024x600 or 1280x800 LCD.  or a Solomon SSD2828 to do MIPI.

the GPU and framebuffer requirements for such portable devices are far lower.  you can even get away with only a 30fps update speed, halving the framebuffer power and GPU requirements.

thus, only 1x 32 bit DDR 800 mhz RAM interface would be perfectly sufficient... for *that* scenario (hence the $4 target price), and it would only need around a 350mW power budget (the DRAMs themselves, that is.  the SoC DRAM drivers would i *think* be around an additional 150mW. have to check).


the target being discussed here, by virtue of having interfaces that at full speed consume an estimated 8 watts, these are far outside a tablet/smartphone 100 million+ volumes power budget.

it is basically a radically different market: GPU Graphics Card market, basically.

*if correct*, with the (unconfirmed, anticipated) higher power demands, a plastic package is in no way going to be adequate.

it will have to be ceramic, with a metal top.  i'd also suggest a minimum 25 mm square, to help with thermal dissipation.

the actual pincount, due to the SERDES, might be quite reasonable (except there's a lot of them): maybe 400 to 500 or so.

we will need people who know exactly what they are doing, here, who have done this type of high power high speed ASIC before.

Rudi is the person who springs to mind, not just from the technical capability and 25+ years experience, he also likes what we are doing.

we need some numbers.
Comment 6 Cole Poirier 2020-06-12 17:29:04 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)
> correct.  this is what the original quad core was intended for.
[snip]
> thus, only 1x 32 bit DDR 800 mhz RAM interface would be perfectly
> sufficient... for *that* scenario (hence the $4 target price), and it would
> only need around a 350mW power budget (the DRAMs themselves, that is.  the
> SoC DRAM drivers would i *think* be around an additional 150mW. have to
> check).

Thanks for this clarification, helps a lot for understanding our product's (or now products' ?) market segmentation.

> the target being discussed here, by virtue of having interfaces that at full
> speed consume an estimated 8 watts, these are far outside a
> tablet/smartphone 100 million+ volumes power budget.
> 
> it is basically a radically different market: GPU Graphics Card market,
> basically.

Oh, that also makes a lot more sense. Kanban will definitely help me keep track of and make sense of these related but separate 'tracks'.
> 
> *if correct*, with the (unconfirmed, anticipated) higher power demands, a
> plastic package is in no way going to be adequate.
> 
> it will have to be ceramic, with a metal top.  i'd also suggest a minimum 25
> mm square, to help with thermal dissipation.
> 
> the actual pincount, due to the SERDES, might be quite reasonable (except
> there's a lot of them): maybe 400 to 500 or so.
> 
> we will need people who know exactly what they are doing, here, who have
> done this type of high power high speed ASIC before.
> 
> Rudi is the person who springs to mind, not just from the technical
> capability and 25+ years experience, he also likes what we are doing.
> 
> we need some numbers.

Wow that is radically different, and very, very exciting! I'm gladdened to hear of Rudy's experience as well as his affinity towards our project. Is it possible that he may know some other people that would be interested in either assisting him, or working on another part of the project?  Of course receiving donations for all of their completed tasks. Or if we are able to find some more investors, we can pay them for new project requirements that are not covered under the scope of our existing NLNET 2018 and 2019 grants?
Comment 7 Jacob Lifshay 2020-06-12 18:26:45 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)
> i am inclined to seriously recommend at least two separate OMI interfaces
> (with the option to use them in parallel), one to be connected to a video
> DRAM chip, the other to be connected to a "OS / GPU" memory DRAM chip,
> with the option to say "i don't care about that, just use one or both for
> general shared memory"

I strongly disagree with the idea of partitioning memory interfaces between GPU and CPU tasks, since the processors are designed to do both kinds of tasks well and splitting the memory interfaces up is basically saying: "all the CPU tasks can't use as much memory bandwidth as GPU tasks just because we say so". Also, I envision people building new varieties of tasks that are a hybrid between CPU and GPU tasks, so it will be harder to differentiate them because of things that fall in the grey areas.
Comment 8 Cole Poirier 2020-06-12 18:34:40 BST
(In reply to Jacob Lifshay from comment #7)
> (In reply to Luke Kenneth Casson Leighton from comment #3)
> > i am inclined to seriously recommend at least two separate OMI interfaces
> > (with the option to use them in parallel), one to be connected to a video
> > DRAM chip, the other to be connected to a "OS / GPU" memory DRAM chip,
> > with the option to say "i don't care about that, just use one or both for
> > general shared memory"
> 
> I strongly disagree with the idea of partitioning memory interfaces between
> GPU and CPU tasks, since the processors are designed to do both kinds of
> tasks well and splitting the memory interfaces up is basically saying: "all
> the CPU tasks can't use as much memory bandwidth as GPU tasks just because
> we say so". Also, I envision people building new varieties of tasks that are
> a hybrid between CPU and GPU tasks, so it will be harder to differentiate
> them because of things that fall in the grey areas.

I hadn't considered this. I think that this should be carefully discussed and considered with someone who is an expert on data and on hardware interconnects/buses. Rudy is one such person right?

I think the idea of new varieties of hybrid cpu-gpu algorithms is a very interesting and compelling one. I'd be very interested in what ideas other experts like those on comp.arch would have.
Comment 9 Jacob Lifshay 2020-06-12 18:44:01 BST
Additional note: Gigabit ethernet requires 4 tx and 4 rx serdes assuming we have the PHY integrated on-chip. All 4 twisted pairs are operated in both directions simultaneously.

See section in https://en.wikipedia.org/wiki/Gigabit_Ethernet#1000BASE-T

If we're integrating our own PHYs, it'd be nice to support 2.5G, 5G, and/or 10G as well.
Comment 10 Cole Poirier 2020-06-12 19:00:48 BST
(In reply to Jacob Lifshay from comment #9)
> Additional note: Gigabit ethernet requires 4 tx and 4 rx serdes assuming we
> have the PHY integrated on-chip. All 4 twisted pairs are operated in both
> directions simultaneously.
> 
> See section in https://en.wikipedia.org/wiki/Gigabit_Ethernet#1000BASE-T
> 
> If we're integrating our own PHYs, it'd be nice to support 2.5G, 5G, and/or
> 10G as well.

Our own PHYs as in have Rudy and SymbyoticEDA design them, or actually doing them ourselves? ... because from discussion of this from a few months ago it seems like designing our own PHYs would be a 5+ year undertaking.

The higher speed ethernet interfaces will be very important for the higher power consumption higher-end market segments. Good idea.
Comment 11 Luke Kenneth Casson Leighton 2020-06-12 19:01:09 BST
(In reply to Jacob Lifshay from comment #7)
> (In reply to Luke Kenneth Casson Leighton from comment #3)
> > i am inclined to seriously recommend at least two separate OMI interfaces
> > (with the option to use them in parallel), one to be connected to a video
> > DRAM chip, the other to be connected to a "OS / GPU" memory DRAM chip,
> > with the option to say "i don't care about that, just use one or both for
> > general shared memory"
> 
> I strongly disagree with the idea of partitioning memory interfaces between
> GPU and CPU tasks,

ah that is a misunderstanding, to clarify:

> > arallel), one to be connected to a video
> > DRAM chip,

this for the video framebuffer and the video framebuffer only.

> > DRAM chip, the other to be connected to a "OS / GPU" memory DRAM chip,

this for the OS which in our case because it is a hybrid CPUVPUGPU automatically and inherently includes CPU tasks, GPU tasks and VPU tasks.
Comment 12 Jacob Lifshay 2020-06-12 19:08:35 BST
Assuming we're building a higher than 2W version, I think we should double the int/fpmul to 8x32-bit per core or maybe even quadruple it to 16x32-bit and add more cores to 8 or 16 or more cores. The div and transcendental pipelines wouldn't probably need to be expanded since they would probably still have enough throughput. Also, it might be worthwhile to add one more core and disable one at manufacture time as a way to increase yield.

If possible, I think we should implement a cache-coherent memory interface over the PCIe bus so we can just have the host's linux kernel schedule processes as usual between the host POWER9 processor and the GPU, acting like the GPU is just more processor cores to schedule. That will allow nice things like using the GPU as extra CPU cores to help accelerate compile or other tasks and being able to use Kazan without needing lots of in-kernel GPU queues and stuff.
Comment 13 Jacob Lifshay 2020-06-12 19:13:22 BST
(In reply to Luke Kenneth Casson Leighton from comment #11)
> (In reply to Jacob Lifshay from comment #7)
> > (In reply to Luke Kenneth Casson Leighton from comment #3)
> > > i am inclined to seriously recommend at least two separate OMI interfaces
> > > (with the option to use them in parallel), one to be connected to a video
> > > DRAM chip, the other to be connected to a "OS / GPU" memory DRAM chip,
> > > with the option to say "i don't care about that, just use one or both for
> > > general shared memory"
> > 
> > I strongly disagree with the idea of partitioning memory interfaces between
> > GPU and CPU tasks,
> 
> ah that is a misunderstanding, to clarify:
> 
> > > arallel), one to be connected to a video
> > > DRAM chip,
> 
> this for the video framebuffer and the video framebuffer only.

That doesn't really work, since the GPU will need lots of memory bandwidth into the framebuffer since that's where it will be rendering to, potentially drawing over the same pixels several dozen times. To support that, the memory bandwidth of both the framebuffer and everything else needs to be spread across all available memory interfaces.
Comment 14 Luke Kenneth Casson Leighton 2020-06-12 22:51:37 BST
(In reply to Jacob Lifshay from comment #13)

> > this for the video framebuffer and the video framebuffer only.
> 
> That doesn't really work, since the GPU will need lots of memory bandwidth
> into the framebuffer since that's where it will be rendering to, potentially
> drawing over the same pixels several dozen times. To support that, the
> memory bandwidth of both the framebuffer and everything else needs to be
> spread across all available memory interfaces.

ok let's think it through, internally.  would we have:

* two separate memory interfaces, each dedicated to different address ranges
* one L2 (L3?) cache, through which *both* memory interfaces have to go, first
  (note: the CPU/GPU as a Wishbone Slave, the RGBTTL HDL as a Master)

or would we have:

* two separate memory interfaces, each dedicated to different address ranges
* one L2 (L3?) cache, through which the CPU/GPU would read/write
  (note: as a Wishbone Slave)
* the RGBTTL Wishbone Master would *bypass* the L2 cache and go directly
  out to memory

if the L2 cache was write-through, then effectively i believe those are
the exact same thing?  the advantage of the 2nd being that it cannot
interfere with or be distracted by any page faults or misses in the L2 cache.

(the RGBTTL interface *has* to - without fail - be able to feed its
SRAM buffer and meet its scanline timing - without fail.  hence why
it is a Wishbone Master)
Comment 15 Luke Kenneth Casson Leighton 2020-06-12 22:52:32 BST
Yehowshua writes, "Some notes from my conversation with Michael":

>> from the ecp5 note on the serdes
>>   - "3.2 Gbps operation with low 85
>>  mW power per channel”
>
> Ok
> hmm
> channel i say 1rx or 1tx?
>
>> 1 tx/rx pair
>
>
> so 8 pairs at 3.2Gbps is about a watt
> Can't imagine for 25Gbps
>
>> Oh, and ecp5 is a 40nm chip
>> so it should be comparable to ours
>
>
> that’s what I’m afraid of
>
>
>> they might be targeting a different subprocess
>> than us though. I think most fabs offer the
>> options of like "low power", "high performance”,
>>  or "high density"
>
>> and I suspect ours will need to be less flexible than the ecp5's so that might save us some power
Comment 16 Luke Kenneth Casson Leighton 2020-06-12 23:00:26 BST
(In reply to Luke Kenneth Casson Leighton from comment #15)

> > so 8 pairs at 3.2Gbps is about a watt
> > Can't imagine for 25Gbps

if we assume it's linear: 8x.  8 watts.

that's if it's linear.  (if it's a square law, it's 64 watts.
i don't believe that to be the case).

chances are that it's linear, because the power consumption is
proportional to the amount of change that the signal has to be
dragged up and down.

consequently, we multiply the current draw by 8.

if however the speed of pullup/pulldown has to be *faster* than
it is carried out at 3.2ghz, then we're in trouble.  it might
not be a square-law exactly but it would be close.


> > they might be targeting a different subprocess
> > than us though. I think most fabs offer the
> > options of like "low power", "high performance”,
> >  or "high density"

this will be a dedicated analog block, extremely specialist,
with a full custom layout.

only when you have a digital layout do you have the option
of telling the yosys "synth" command:

* "please can you optimise for lower latency" (i don't care about power)

or

* "please can you optimise for gate reduction" (i don't care about latency)

these give you big and LITTLE respectively... but it's *only* something
you can do for digital layout.
Comment 17 Luke Kenneth Casson Leighton 2020-06-12 23:05:51 BST
(In reply to Jacob Lifshay from comment #12)
> Assuming we're building a higher than 2W version, I think we should double
> the int/fpmul to 8x32-bit per core or maybe even quadruple it to 16x32-bit
> and add more cores to 8 or 16 or more cores.

at 2.0 ghz this would start to put out some serious numbers :)
i'll do a walk-through separately so we are eyes-open on the layout and
wire count.

> The div and transcendental
> pipelines wouldn't probably need to be expanded since they would probably
> still have enough throughput.

the number of Function Units front-ends (aka "Reservation Stations") is
something we have to keep an eye on.  above 30 FUs is pretty serious
size territory for the Dependency Matrices.

> Also, it might be worthwhile to add one more
> core and disable one at manufacture time as a way to increase yield.

that's a good idea.  have to bear in mind that current leakage occurs
regardless of whether the silicon is in use or not.

> If possible, I think we should implement a cache-coherent memory interface
> over the PCIe bus so we can just have the host's linux kernel schedule
> processes as usual between the host POWER9 processor and the GPU, acting
> like the GPU is just more processor cores to schedule. That will allow nice
> things like using the GPU as extra CPU cores to help accelerate compile or
> other tasks and being able to use Kazan without needing lots of in-kernel
> GPU queues and stuff.

oo interesting.
Comment 18 Jacob Lifshay 2020-06-12 23:19:51 BST
(In reply to Luke Kenneth Casson Leighton from comment #14)
> (In reply to Jacob Lifshay from comment #13)
> 
> > > this for the video framebuffer and the video framebuffer only.
> > 
> > That doesn't really work, since the GPU will need lots of memory bandwidth
> > into the framebuffer since that's where it will be rendering to, potentially
> > drawing over the same pixels several dozen times. To support that, the
> > memory bandwidth of both the framebuffer and everything else needs to be
> > spread across all available memory interfaces.
> 
> ok let's think it through, internally.  would we have:
> 
> * two separate memory interfaces, each dedicated to different address ranges
> * one L2 (L3?) cache, through which *both* memory interfaces have to go,
> first
>   (note: the CPU/GPU as a Wishbone Slave, the RGBTTL HDL as a Master)

I think what would work the best is for the RGBTTL HDL and every core to be a (extended) wishbone master to the L2 (L3?) cache, where the cache logic is the arbiter and is designed to give the RGBTTL HDL highest priority and everything else round robin (or other) priority. Saving power on scan-out when the data is already in cache seems like a good idea, also, the memory interfaces would be the tightest bottleneck, why require more accesses to go through them when that can be avoided?

The memory interfaces would be organized into one larger super-interface where each memory interface would be responsible for odd or even cache-block-sized address blocks. The idea is that accessing something laid out contiguously in physical address space would approximate balancing evenly across both memory interfaces.
Comment 19 Jacob Lifshay 2020-06-12 23:32:48 BST
(In reply to Luke Kenneth Casson Leighton from comment #17)
> (In reply to Jacob Lifshay from comment #12)
> > Assuming we're building a higher than 2W version, I think we should double
> > the int/fpmul to 8x32-bit per core or maybe even quadruple it to 16x32-bit
> > and add more cores to 8 or 16 or more cores.
> 
> at 2.0 ghz this would start to put out some serious numbers :)

I got 1TFLOP/s for 16 fma/clock/core with 16 cores, that's more than half the PS4's performance.

> > Also, it might be worthwhile to add one more
> > core and disable one at manufacture time as a way to increase yield.
> 
> that's a good idea.  have to bear in mind that current leakage occurs
> regardless of whether the silicon is in use or not.

If we set up each core as a different power domain (which will also help with idle power), the disabled core could be powered-down. We would probably want each core to be it's own clock domain.

We should probably also widen the instruction decoders to decode 3 or 4 32-bit instructions per cycle.
Comment 20 Luke Kenneth Casson Leighton 2020-06-12 23:55:18 BST
(In reply to Jacob Lifshay from comment #12)
> Assuming we're building a higher than 2W version, I think we should double
> the int/fpmul to 8x32-bit per core or maybe even quadruple it to 16x32-bit
> and add more cores to 8 or 16 or more cores.

ok.  so expanding the number of cores is easy (less hassle) than expanding
the number Function Units (which directly correlates with the size of the
Dependency Matrices).  so there is that in its favour.

the problem is that if you significantly increase the number of cores, SMP
coherency meets a point of diminishing returns unless you start doing very
advanced L1/L2 cache design and interconnect.  so we had better take that
into consideration and budget and plan for it, appropriately.

using OpenPITON for example, this is a full "L1, L1.5, L2" write-through cache
strategy, which although it scales massively, might not provide good enough
localised performance.


basically if you double to 8x 32-bit per core, that means 4x 64-bit LD/ST
operations per clock cycle, because we would have 64-bit 2xFP32 SIMD units
(dynamic partitioning if we can do it, otherwise literally just 2x FP32
units).

the L0CacheBuffer will take care of amalgamating those into 128-bit cache-line 
requests (cache line width to be evaluated, below).

4x 64-bit LD/ST operations, we will have out-of-order requests for at least
one pair of LDs and STs "in flight", simultaneously, therefore we need to
be able to hold *both* in the Reservation Stations, and preferably the
next set of LDs (at least), as well.

that's *twelve* LD/ST Computation Units.

that's *just* the LD/ST Computation Units - it does not include the
FP Computation Units as well.  4x 64-bit SIMD operations being processed,
let us assume those are FMACs, and assume a 4-stage pipeline, we would
need say 8 Reservation Stations because we want 4 in operation and another
4 "in-flight" to be held (so that we do not get a stall after processing
the previous ST).

- 4x LD                        in RS, being processed
-   4x 64-bit SIMD FP32        in RS, waiting
-     4x ST                    in RS, waiting
-       4x LD                  in RS, waiting
-          4x 64-bit SIMD FP32 in RS, waiting
-             4x ST            stall (only 12 LD/ST RS's)

the next clock cycle, after the 1st LDs are complete:

- 4x LD                        done
-   4x 64-bit SIMD FP32        in RS, waiting
-     4x ST                    in RS, waiting
-       4x LD                  in RS, waiting
-          4x 64-bit SIMD FP32 in RS, waiting
-             4x ST            now in RS, waiting (1st 4 LD/ST RS's were free)

so that's 20 Reservation Stations: 12 for LD/ST, 8 for FP.

let's go back to the 12x LD/ST operations.  each would have 2x PortInterfaces:
one for aligned, one for misaligned.  the L0CacheBuffer would therefore have
12 rows, 2 side-by-side odd/even addresses, receiving 24x PortInterfaces in
total.

each PortInterface would be around 160 wires wide (64 data, 64 addr, control)
that' 3840 wires into the L0CacheBuffer.

this is one hell of a lot of wires going into one small piece of silicon.

now let's see if 2x 128-bit L1 Caches are ok.

* 4x 64-bit requests will result in 2x 128-bit cache line requests.
* with 2x 128-bit Caches (one odd, one even), we are *just* ok

this as long as those requests can be made in a single cycle.  this
does however mean that we need 4x 64-bit Wishbone Buses down to the
L2 Cache.

it also means that, because of the pass-through nature of GPU workloads
(data in, process, data out), we might need to sustain those 4x 64-bit
pathways *right* the way through to memory.

actually... no.  this might not be enough.  or, if it is, it's barely
enough.

we need the in-flight requests to be in-flight because they constitute
"advance notice" to the L1 and L2 caches.  with the LDs and STs being
matched pairs, we *might* need just the one more LD set (20 LD/ST
RSes) and 1 more FP set (12 FP RSes) so as to be able to do this:

- 4x LD                              in RS, being processed
-   4x 64-bit SIMD FP32              in RS, waiting
-     4x ST                          in RS, waiting
-       4x LD                        in RS, waiting
-         4x 64-bit SIMD FP32        in RS, waiting
-           4x ST                    in RS, waiting
-             4x LD                  in RS, waiting
-               4x 64-bit SIMD FP32  in RS, waiting
-                 4x ST              stall (only 20 LD/ST RS's)

this would give us *advance notice* of the *next* set of 4x LD/ST,
which would do:

* Effective Address Computation (no problem)
* then the next phase push through to the L0CacheBuffer and
* pass through the requests to the L1 Cache and TLB and
* initiate a L2 Cache and L2 TLB lookup

whilst the request to memory might take a while to complete, at least
it would be outstanding, the reservation of the L1 Cache Line and
the L2 cache line would be made.

and for that to work, i think that 2x 128-bit cache lines isn't going
to cut it: we'd need 4.

so i think the L0CacheBuffer would need to do 2-bit striping:

* bits 4 and 5 == 0b00  Bank 0, L1 Cache number #1
* bits 4 and 5 == 0b01  Bank 1, L1 Cache number #2
* bits 4 and 5 == 0b10  Bank 2, L1 Cache number #3
* bits 4 and 5 == 0b11  Bank 3, L1 Cache number #4

this is basically monstrous.

for 16x 32-bit FP, i would not recommend going to 3-bit striping (8 L1
caches), i'd recommend expanding 4x L1 caches to 256-bit wide cache lines, 
instead.

we would also need 48 LD/ST Reservation Stations, and 20 FP RSes.

that's starting to get a little scary.  we could conceivably split
the Regfiles into odd-even numbering so that this number could be
halved.

whilst there would still be 48 LD/ST RSes and 20 FP RSes in total,
there would be 2 separate (square) Dependency Matrices @ (24+20) 40 wide.

honestly i would not recommend trying it, not without a much larger
budget.

even 8x FP32 is a little scary.  those 40 RSes have to be joined by
Branch FUs, Condition Register FUs, predication FUs and so on.

which is why i said, it's probably much simpler to double the number
of SMP cores, instead.
Comment 21 Luke Kenneth Casson Leighton 2020-06-13 00:02:06 BST
(In reply to Jacob Lifshay from comment #19)
> (In reply to Luke Kenneth Casson Leighton from comment #17)
> > (In reply to Jacob Lifshay from comment #12)
> > > Assuming we're building a higher than 2W version, I think we should double
> > > the int/fpmul to 8x32-bit per core or maybe even quadruple it to 16x32-bit
> > > and add more cores to 8 or 16 or more cores.
> > 
> > at 2.0 ghz this would start to put out some serious numbers :)
> 
> I got 1TFLOP/s for 16 fma/clock/core with 16 cores, that's more than half
> the PS4's performance.

those are deeply impressive numbers :)  the cross-over post explains however
that the number of RSes - and the data cache line widths etc. - are starting
to get rather scary.

48 LD/ST Reservation Stations would mean 96 160-bit PortInterfaces,
which is 15,360 wires coming into a single block!

we would be much better off going to 128-bit SIMD, which would have the
unfortunate side-effect of cutting scalar performance by another factor
of 2 in speed if there were register usage that could not be scheduled
correctly.


> > > Also, it might be worthwhile to add one more
> > > core and disable one at manufacture time as a way to increase yield.
> > 
> > that's a good idea.  have to bear in mind that current leakage occurs
> > regardless of whether the silicon is in use or not.
> 
> If we set up each core as a different power domain (which will also help
> with idle power), the disabled core could be powered-down.

(you still get current leakage even when powered down, is my point)


> We should probably also widen the instruction decoders to decode 3 or 4
> 32-bit instructions per cycle.

yes.  this will almost certainly be necessary, because we would have far more
instructions to keep the monstrous number of RSes "fed".
Comment 22 Luke Kenneth Casson Leighton 2020-06-13 00:11:31 BST
(In reply to Jacob Lifshay from comment #18)
> (In reply to Luke Kenneth Casson Leighton from comment #14)
> > (In reply to Jacob Lifshay from comment #13)
> > 
> > > > this for the video framebuffer and the video framebuffer only.
> > > 
> > > That doesn't really work, since the GPU will need lots of memory bandwidth
> > > into the framebuffer since that's where it will be rendering to, potentially
> > > drawing over the same pixels several dozen times. To support that, the
> > > memory bandwidth of both the framebuffer and everything else needs to be
> > > spread across all available memory interfaces.
> > 
> > ok let's think it through, internally.  would we have:
> > 
> > * two separate memory interfaces, each dedicated to different address ranges
> > * one L2 (L3?) cache, through which *both* memory interfaces have to go,
> > first
> >   (note: the CPU/GPU as a Wishbone Slave, the RGBTTL HDL as a Master)
> 
> I think what would work the best is for the RGBTTL HDL and every core to be
> a (extended) wishbone master to the L2 (L3?) cache,

no: really.  the cores *have* to take a back seat.  it seems odd that
they have such a highly important role but they are actually "slaves",
however the fact is that in a Shared Memory Architecture, I/O absolutely
cannot be given anything but absolute top priority.

this is just normal practice: I/O *has* to have guaranteed timing:
interrupts cannot go unserviced, buffers are extremely small, and
consequently I/O has to have absolute top priority.

look at the Shakti E-Class HDL, you'll see that the SMP Cores are given
AXI4 "slave" status on the internal bus architecture.


> where the cache logic is
> the arbiter and is designed to give the RGBTTL HDL highest priority and
> everything else round robin (or other) priority. Saving power on scan-out
> when the data is already in cache seems like a good idea, also, the memory
> interfaces would be the tightest bottleneck, why require more accesses to go
> through them when that can be avoided?

yes.  ideally (following the scenario through) the RGB/TTL HDL Master would 
have a completely separate memory bus entirely dedicated to it, reflecting
the fact that it has a completely separate DRAM chip.

however this only starts to be justified if there are say two or three
4k LCDs connected up, where the bandwidth of one 3200 Mbyte/sec interface
would pretty much be soaked up entirely by framebuffers.


> The memory interfaces would be organized into one larger super-interface
> where each memory interface would be responsible for odd or even
> cache-block-sized address blocks. The idea is that accessing something laid
> out contiguously in physical address space would approximate balancing
> evenly across both memory interfaces.

yes.  if going the shared route, this seems eminently sensible.
Comment 23 Jacob Lifshay 2020-06-13 00:14:19 BST
(In reply to Luke Kenneth Casson Leighton from comment #21)
> (In reply to Jacob Lifshay from comment #19)
> > > > Also, it might be worthwhile to add one more
> > > > core and disable one at manufacture time as a way to increase yield.
> > > 
> > > that's a good idea.  have to bear in mind that current leakage occurs
> > > regardless of whether the silicon is in use or not.
> > 
> > If we set up each core as a different power domain (which will also help
> > with idle power), the disabled core could be powered-down.
> 
> (you still get current leakage even when powered down, is my point)

I meant something like having a mosfet in the power line to the whole core, so the entire power supply could be turned off. Assuming that mosfet wasn't garbage, the power usage for the whole core could be reduced to the microwatt level.

> > We should probably also widen the instruction decoders to decode 3 or 4
> > 32-bit instructions per cycle.
> 
> yes.  this will almost certainly be necessary, because we would have far more
> instructions to keep the monstrous number of RSes "fed".

We would also want to build a better branch predictor (TAGE?) since we would have a higher power/area budget.

Both of those changes would potentially also drastically improve scalar performance, approaching that of some modern desktop processors at equivalent clock speeds.
Comment 24 Luke Kenneth Casson Leighton 2020-06-13 01:20:59 BST
(In reply to Jacob Lifshay from comment #23)
> (In reply to Luke Kenneth Casson Leighton from comment #21)
> > (In reply to Jacob Lifshay from comment #19)
> > > > > Also, it might be worthwhile to add one more
> > > > > core and disable one at manufacture time as a way to increase yield.
> > > > 
> > > > that's a good idea.  have to bear in mind that current leakage occurs
> > > > regardless of whether the silicon is in use or not.
> > > 
> > > If we set up each core as a different power domain (which will also help
> > > with idle power), the disabled core could be powered-down.
> > 
> > (you still get current leakage even when powered down, is my point)
> 
> I meant something like having a mosfet in the power line to the whole core,
> so the entire power supply could be turned off. Assuming that mosfet wasn't
> garbage, the power usage for the whole core could be reduced to the
> microwatt level.

apologies: you're still not getting it.  even the *existence* of the gates, even when fully powered down, with zero oower connected in any way, shape or form, *still* causes not insignificant current leakage.

this shows up particularly badly in the design of GSM ASICs.  thousands of correlators are required to get an initial lock within "seconds" expected by users... or... and then they are "powered down" exactly as you suggest.

unfortunately even their very existence, even powered down, causes current leakage, adversely affecting power consumption and draining battery life.

the solution had to involve "A-GPS" where the LAT/LONG (or even just one of those) was obtained from the bearest celltower.  this reduced the correlator search soace to the point where a decent DSP could handle the job instead.


> > > We should probably also widen the instruction decoders to decode 3 or 4
> > > 32-bit instructions per cycle.
> > 
> > yes.  this will almost certainly be necessary, because we would have far more
> > instructions to keep the monstrous number of RSes "fed".
> 
> We would also want to build a better branch predictor (TAGE?) since we would
> have a higher power/area budget.

yes, agreed.  cancelling 40 Reservation Stations on a regular basis is not an amusing prospect.

> Both of those changes would potentially also drastically improve scalar
> performance, approaching that of some modern desktop processors at
> equivalent clock speeds.

one very nice thing about SimpleV is that the VL (Vector Length) can be thrown into the branch tag XOR/hash/thing.

this would give a different prediction at the end of a loop, right just when you need different decision-making about which way a branch is likely to go.
Comment 25 Luke Kenneth Casson Leighton 2020-06-13 04:37:32 BST
i took a look at the OPENCAPI PDF on the proposed interfaces.  IBM put OpenCAPI 2 was on top of 8x PCIe 4.0

the later slides, blacked out, show rhat POWER7 was in 40nm.

later, by the time they move to 25GBit/s, they are for POWER9 which is in 14nm.

this, in conjunction with the power requirements, basically tells us that there is an unachievable mismatch between 25.6 Gbit/s SERDES and a 40/45nm target.

if we were instead to target 14nm or below, we could achieve a 25 GBit/s SERDES PHY.

this however would require that, as our first major production chip, we seek a minimum of USD 20 million funding.

if however we stick to PCIe @ 3200 mhz, this *is* achievable in 40/45nm.  with sufficient lanes it becomes possible to reach the bandwidth required, and to not expect investors to take a huge risk on an unproven team.


alternatively, we stick with plain DDR3/4 because it is also achievable, but expensive and timeconsuming (SymbioticEDA: 8 to 12 months fulltime work on a PHY, USD 600,000).

alternatively, we use multiple HyperRAM interfaces and overclock the protocol, connecting to a suitable FPGA that either has a DDR3/4 interface *or* suitable SERDES sufficient to do OpenCAPI.  the ECP5G springs to mind although its DRAM interface is only capable of 200mhz (400mhz DDR).

the advantage of a 3200 mhz SERDES is that it is useful for a large number of things that rely on a PCIe PHY.  OpenCAPI, HMC, and PCIe itself.
Comment 26 Luke Kenneth Casson Leighton 2020-06-14 18:27:34 BST
(In reply to Jacob Lifshay from comment #23)

> I meant something like having a mosfet in the power line to the whole core,
> so the entire power supply could be turned off. Assuming that mosfet wasn't
> garbage, the power usage for the whole core could be reduced to the
> microwatt level.

yep, as you can see from the list reply, i've caught up now.  with zero
power being applied between the VDD and VSS plane, the expectation would
be that there would be no current to actually leak.

this is the point at which my knowledge is lacking, and we'd have to ask
someone.  *nominally*, my understanding is that even powered down there
is still leakage, although because i did not ask that specific question
of the people who advised me, i can't confirm it.
Comment 27 Staf Verhaegen 2020-06-15 17:10:13 BST
> this is the point at which my knowledge is lacking, and we'd have to ask
> someone.  *nominally*, my understanding is that even powered down there
> is still leakage, although because i did not ask that specific question
> of the people who advised me, i can't confirm it.

No, if powered down there is nothing to leak...
What will remain is the leakage of the power gate itself, which should be negligible as that's the actual function of the power gate.
Comment 28 Luke Kenneth Casson Leighton 2020-06-15 18:46:16 BST
(In reply to Staf Verhaegen from comment #27)
> > this is the point at which my knowledge is lacking, and we'd have to ask
> > someone.  *nominally*, my understanding is that even powered down there
> > is still leakage, although because i did not ask that specific question
> > of the people who advised me, i can't confirm it.
> 
> No, if powered down there is nothing to leak...
> What will remain is the leakage of the power gate itself, which should be
> negligible as that's the actual function of the power gate.

ah brilliant, thank you for your input, here, staf.

l.