Currently verilator performance does not scale with multiple threads due to its internal queue model and its heavy use of mutex objects to lock the queue. Because of that, simulation performance does not take advantage of CPUs with many cores. After some initial profiling, I have found that most of the CPU time is spent in the internal queue: 41.71% microwatt-verilator [.] VlMTaskVertex::waitUntilUpstreamDone 32.97% microwatt-verilator [.] VlWorkerThread::dequeWork 8.72% microwatt-verilator [.] VlMTaskVertex::signalUpstreamDone So about 84% of CPU time is spent on synchronization between threads. This is a huge waste of CPU time and definitely something that can be fixed. I believe that replacement of the internal queue with a lockless thread-safe queue will increase performance by at least an order of magnitude. I have done this in the past in very demanding realtime applications and performance was greatly improved many times. The plan is to also submit this work upstream to benefit the verilator project overall. I believe that a budget between 7-10k EUR would suffice for this kind of work. It goes without saying that it will be heavily tested before submission.