Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000
by Dr. Ian Cutress on December 3, 2020 10:00 AM EST- Posted in
- CPUs
- AMD
- Zen 3
- X570
- Ryzen 5000
- Ryzen 9 5950X
- SMT
- Multi-Threading
CPU Performance
For simplicity, we are listing the percentage performance differentials in all of our CPU testing – the number shown is the % performance of having SMT2 enabled compared to having the setting disabled. Our benchmark suite consists of over 120 tests, full details of which can be found in our #CPUOverload article.
Here are the single threaded results.
Single Threaded Tests AMD Ryzen 9 5950X |
||
AnandTech | SMT Off Baseline |
SMT On |
y-Cruncher | 100% | 99.5% |
Dwarf Fortress | 100% | 99.9% |
Dolphin 5.0 | 100% | 99.1% |
CineBench R20 | 100% | 99.7% |
Web Tests | 100% | 99.1% |
GeekBench (4+5) | 100% | 100.8% |
SPEC2006 | 100% | 101.2% |
SPEC2017 | 100% | 99.2% |
Interestingly enough our single threaded performance was within a single percentage point across the stack (SPEC being +1.2%). Given that ST mode should arguably give more resources to each thread for consistency, the fact that we see no difference means that AMD’s implementation of giving a single thread access to all the resources even in SMT mode is quite good.
The multithreaded tests are a bit more diverse:
Multi-Threaded Tests AMD Ryzen 9 5950X |
||
AnandTech | SMT Off Baseline |
SMT On |
Agisoft Photoscan | 100% | 98.2% |
3D Particle Movement | 100% | 165.7% |
3DPM with AVX2 | 100% | 177.5% |
y-Cruncher | 100% | 94.5% |
NAMD AVX2 | 100% | 106.6% |
AIBench | 100% | 88.2% |
Blender | 100% | 125.1% |
Corona | 100% | 145.5% |
POV-Ray | 100% | 115.4% |
V-Ray | 100% | 126.0% |
CineBench R20 | 100% | 118.6% |
HandBrake 4K HEVC | 100% | 107.9% |
7-Zip Combined | 100% | 133.9% |
AES Crypto | 100% | 104.9% |
WinRAR | 100% | 111.9% |
GeekBench (4+5) | 100% | 109.3% |
Here we have a number of different factors affecting the results.
Starting with the two tests that scored statistically worse with SMT2 enabled: yCruncher and AIBench. Both tests are memory-bound and compute-bound in parts, where the memory bandwidth per thread can become a limiting factor in overall run-time. yCruncher is arguably a math synthetic benchmark, and AIBench is still early-beta AI workloads for Windows, so quite far away from real world use cases.
Most of the rest of the benchmarks are between a +5% to +35% gain, which includes a number of our rendering tests, molecular dynamics, video encoding, compression, and cryptography. This is where we can see both threads on each core interleaving inside the buffers and execution units, which is the goal of an SMT design. There are still some bottlenecks in the system affecting both threads getting absolute full access, which could be buffer size, retire rate, op-queue limitations, memory limitations, etc – each benchmark is likely different.
The two outliers are 3DPM/3DPMavx, and Corona. These three are 45%+, with 3DPM going 66%+. Both of these tests are very light on the cache and memory requirements, and use the increased Zen3 execution port distribution to good use. These benchmarks are compute heavy as well, so splitting some of that memory access and compute in the core helps SMT2 designs mix those operations to a greater effect. The fact that 3DPM in AVX2 mode gets a higher benefit might be down to coalescing operations for an AVX2 load/store implementation – there is less waiting to pull data from the caches, and less contention, which adds to some extra performance.
Overall
In an ideal world, both threads on a core will have full access to all resources, and not block each other. However, that just means that the second thread looks like it has its own core completely. The reverse SMT method, of using one global core and splitting it into virtual cores with no contention, is known as VISC, and the company behind that was purchased by Intel a few years ago, but nothing has come of it yet. For now, we have SMT, and by design it will accelerate some key workloads when enabled.
In our CPU results, the single threaded benchmarks showed no uplift with SMT enabled/disabled in our real-world or synthetic workloads. This means that even in SMT enabled mode, if one thread is running, it gets everything the core has on offer.
For multi-threaded tests, there is clearly a spectrum of workloads that benefit from SMT.
Those that don’t are either hyper-optimized on a one-thread-per-core basis, or memory latency sensitive.
Most real-world workloads see a small uplift, an average of 22%. Rendering and ray tracing can vary depending on the engine, and how much bandwidth/cache/core resources each thread requires, potentially moving the execution bottleneck somewhere else in the chain. For execution limited tests that don’t probe memory or the cache at all, which to be honest are most likely to be hyper-optimized compute workloads, scored up to +77% in our testing.
126 Comments
View All Comments
Holliday75 - Thursday, December 3, 2020 - link
As usage for modern users changes I wonder how this could be better tested/visualized.I am not looking at a 5900x to run any advanced tools. I am looking to game, run mutiple browsers with a few dozen tabs open, stream, download, run Plex (transcoding), security tools, VPN, and the million other applications a normal user would have running at any given point in time. While no two users will have the same workload at any given time, how could we quantify SMT versus no SMT for the average user?
In the not to distance future we could be seeing the average PC running 32 cores. I am talking your run of the mill office machine from Dell that costs $800. Or will we? Is there a point where it does not matter anymore?
realbabilu - Thursday, December 3, 2020 - link
Simple. At average user 4 core 8gen u series have more core than the generation before. It has more strength, but it's rarely got 100 percent cpu utilized for those normal you doing.To get 8 threads or 4 cores work 100 percent need killer applications that programmed by man know how to extract every juice of it processor, know how to program multithread, or using optimized math kernel.library / optimized compiler switch like FEM, Render, math applied science.
Other than those app, maybe you could expense it to gpu for gaming.
schujj07 - Thursday, December 3, 2020 - link
Or you just have multiple tabs open. I regularly hit 100% usage on my work i5-6400 with 4c/4t having 10-12 tabs open. It gets quite annoying as on a normal day I might need up to double that open at any given time. That means that 20 tabs would peg a 4c/8t CPU pretty easily.evilpaul666 - Friday, December 4, 2020 - link
You need an ad blocker unless those tabs are all very busy doing something. I mean, it sounds like they're mining Monero for somebody else, I mean what they're *supposed* to be doing for you.schujj07 - Friday, December 4, 2020 - link
I use an ad blocker and nothing is being mined. However, ads are an example of things that will destroy your performance in web browsing quite quickly and suck up a lot of CPU cycles. While right now 4c/8t is enough for an office machine, it will not be long before 6c/12t is the standard.marrakech - Tuesday, December 15, 2020 - link
15 cores are the futureeeeeeHulk - Thursday, December 3, 2020 - link
Wouldn't high SMT performance be an indication of bad software design rather than bad core design?While SMT performance is changing in these tests the core is not. Only the software is changing. It seems as though an Intel CPU in this comparison would have provided additional insights to these questions.
BillyONeal - Thursday, December 3, 2020 - link
The situations that create high SMT performance are generally outside the software in question's control. For example, a program might have 1 thread that's doing all divides and another that's doing all multiplies. The thread that only has multiplies or divisions aren't poorly designed, they just aren't using units on the chip that don't help their respective workloads.There are also cache effects. If you have 2 threads working on data bigger than the CPU's caches while one is waiting for that data to come back from memory the other can make unrelated progress and vice versa, but the data being big isn't necessarily an indicator of poor software design. Some problem domains just have big data sets there's no way around.
WaltC - Thursday, December 3, 2020 - link
Exactly. Some software is written to utilize a lot of threads simultaneously, some is not. Running software that does not make use of a lot of simultaneous threads tells us really nothing much about SMT CPU hardware, imo, other than "this software doesn't support it very well."Elstar - Thursday, December 3, 2020 - link
SMT24? Ha. Try SMT128: https://en.wikipedia.org/wiki/Cray_XMT#Threadstorm...