Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000
by Dr. Ian Cutress on December 3, 2020 10:00 AM EST- Posted in
- CPUs
- AMD
- Zen 3
- X570
- Ryzen 5000
- Ryzen 9 5950X
- SMT
- Multi-Threading
CPU Performance
For simplicity, we are listing the percentage performance differentials in all of our CPU testing – the number shown is the % performance of having SMT2 enabled compared to having the setting disabled. Our benchmark suite consists of over 120 tests, full details of which can be found in our #CPUOverload article.
Here are the single threaded results.
Single Threaded Tests AMD Ryzen 9 5950X |
||
AnandTech | SMT Off Baseline |
SMT On |
y-Cruncher | 100% | 99.5% |
Dwarf Fortress | 100% | 99.9% |
Dolphin 5.0 | 100% | 99.1% |
CineBench R20 | 100% | 99.7% |
Web Tests | 100% | 99.1% |
GeekBench (4+5) | 100% | 100.8% |
SPEC2006 | 100% | 101.2% |
SPEC2017 | 100% | 99.2% |
Interestingly enough our single threaded performance was within a single percentage point across the stack (SPEC being +1.2%). Given that ST mode should arguably give more resources to each thread for consistency, the fact that we see no difference means that AMD’s implementation of giving a single thread access to all the resources even in SMT mode is quite good.
The multithreaded tests are a bit more diverse:
Multi-Threaded Tests AMD Ryzen 9 5950X |
||
AnandTech | SMT Off Baseline |
SMT On |
Agisoft Photoscan | 100% | 98.2% |
3D Particle Movement | 100% | 165.7% |
3DPM with AVX2 | 100% | 177.5% |
y-Cruncher | 100% | 94.5% |
NAMD AVX2 | 100% | 106.6% |
AIBench | 100% | 88.2% |
Blender | 100% | 125.1% |
Corona | 100% | 145.5% |
POV-Ray | 100% | 115.4% |
V-Ray | 100% | 126.0% |
CineBench R20 | 100% | 118.6% |
HandBrake 4K HEVC | 100% | 107.9% |
7-Zip Combined | 100% | 133.9% |
AES Crypto | 100% | 104.9% |
WinRAR | 100% | 111.9% |
GeekBench (4+5) | 100% | 109.3% |
Here we have a number of different factors affecting the results.
Starting with the two tests that scored statistically worse with SMT2 enabled: yCruncher and AIBench. Both tests are memory-bound and compute-bound in parts, where the memory bandwidth per thread can become a limiting factor in overall run-time. yCruncher is arguably a math synthetic benchmark, and AIBench is still early-beta AI workloads for Windows, so quite far away from real world use cases.
Most of the rest of the benchmarks are between a +5% to +35% gain, which includes a number of our rendering tests, molecular dynamics, video encoding, compression, and cryptography. This is where we can see both threads on each core interleaving inside the buffers and execution units, which is the goal of an SMT design. There are still some bottlenecks in the system affecting both threads getting absolute full access, which could be buffer size, retire rate, op-queue limitations, memory limitations, etc – each benchmark is likely different.
The two outliers are 3DPM/3DPMavx, and Corona. These three are 45%+, with 3DPM going 66%+. Both of these tests are very light on the cache and memory requirements, and use the increased Zen3 execution port distribution to good use. These benchmarks are compute heavy as well, so splitting some of that memory access and compute in the core helps SMT2 designs mix those operations to a greater effect. The fact that 3DPM in AVX2 mode gets a higher benefit might be down to coalescing operations for an AVX2 load/store implementation – there is less waiting to pull data from the caches, and less contention, which adds to some extra performance.
Overall
In an ideal world, both threads on a core will have full access to all resources, and not block each other. However, that just means that the second thread looks like it has its own core completely. The reverse SMT method, of using one global core and splitting it into virtual cores with no contention, is known as VISC, and the company behind that was purchased by Intel a few years ago, but nothing has come of it yet. For now, we have SMT, and by design it will accelerate some key workloads when enabled.
In our CPU results, the single threaded benchmarks showed no uplift with SMT enabled/disabled in our real-world or synthetic workloads. This means that even in SMT enabled mode, if one thread is running, it gets everything the core has on offer.
For multi-threaded tests, there is clearly a spectrum of workloads that benefit from SMT.
Those that don’t are either hyper-optimized on a one-thread-per-core basis, or memory latency sensitive.
Most real-world workloads see a small uplift, an average of 22%. Rendering and ray tracing can vary depending on the engine, and how much bandwidth/cache/core resources each thread requires, potentially moving the execution bottleneck somewhere else in the chain. For execution limited tests that don’t probe memory or the cache at all, which to be honest are most likely to be hyper-optimized compute workloads, scored up to +77% in our testing.
126 Comments
View All Comments
Bomiman - Saturday, December 5, 2020 - link
That common knowledge is a few years old now. It was once common knowledge that games only used one thread.Consoles now have 3 times as many threads as before, and that's in a situation where 4t Cpus are barely usable and 4c 8t Cpus are obsolete.
MrPotatoeHead - Tuesday, December 15, 2020 - link
Xbox360 came out in 2005. 3C/6T. Even the PS3 had a 1C/2T PowerPC PPE and 6 SPEs, so a total of 8T. PS4/XO is 8C/8T. Though I guess we could blame lack of CPU utilization still on this last generation using pretty weak cores from the get go. IIRC 8 core Jaguar would be on par with an Intel i3 at the time of these console releases.Though, the only other option AMD had was Piledriver. Piledriver still poor performer, a power hog, and it would likely only been worth it over 8 Jaguar cores if they went with a 3 or 4 module chip.
It is nice that this generation MS and Sony both went all out on the CPU. Just too bad they aren't Zen 3 based. :(
Dolda2000 - Friday, December 4, 2020 - link
It should be kept in mind that, at the time when AMD criticized Intel for that, that was when AMD had actual dual-cores (A64x2) and Intel still had single-cores with HT, which makes the criticism rather fair.Xajel - Sunday, December 6, 2020 - link
"Intel's HT approach proved superior".Intel's approach wasn't that much superior. In fact, in the early days of Intel's HTT processors, many Applications, even ones which supposed to be optimised for MC code path was getting lower scores with HTT enabled than when HTT was disabled.
The main culprit was that Applications were designed to handle each thread in a real core, not two threads in a single core, the threads were fighting for resources that weren't there.
Intel knew this and worked hard with developers to make them know the difference and apply this change to the code path. This actually took sometime till Multi-Core applications were SMT aware and had a code path for this.
For AMD's case, AMD's couldn't work hard enough like Intel with developers to make them have a new code path just for AMD CPU's. Not to mention that intel was playing it dirty starting with their famous compiler which was -and still- used by most developers to compile applications, the compilers will optimise the code for intel's CPU's and have an optimised code path for every CPU and CPU feature intel have, but when the application detect a non-Intel CPU, including AMD's it will select the slowest code path, and will not try to test the feature and choose a code path.
This applied also to AMD's CPU's, while sure the CPU's lacked FPU performance, and was not competitive enough (even when the software was optimised), but the whole optimisation thing made AMD's CPU inefficient, the idea should work better than Intel, because there's an actual real hardware there (at least for Integer), but developers didn't work harder, and the intel compiler played a major role for smaller developers also.
TL'DR, the main issue was the intel compiler and lack of developers interest, then the actual cores were also not that much stronger than intel's (IPC side), AMD's idea should have worked, but things weren't in their side.
And by the time AMD came with their design, they were already late, applications were already optimised for Intel HTT which became very good as almost all applications became SMT aware. AMD acknowledged this and knew that they must take what developers already have and work on it, they also worked hard on their SMT implementation that it is touted now that their SMT is better intel's own SMT implementation (HTT).
Keljian - Sunday, January 10, 2021 - link
Urm no, intel’s compiler isn’t used often these days unless you’re doing really heavy maths. Microsoft’s compiler is used much more often, though clang is taking offpogsnet - Tuesday, December 29, 2020 - link
During P4, HT gives no difference in performance compared to AMD64 but on Core2Duo there it shows better performance. Probably because we have only 2-4 cores and not enough for our multi tasking needs, Now we have 4-32 cores plus much powerful and efficient cores, hence, SMT maybe not that significant already that is why on most test it shows no big performance lift.willis936 - Thursday, December 3, 2020 - link
5%? I think more than 5% is needed for a whole second set of registers plus the logic needed to properly handle context switching. Everything in between the cache and pipeline needs to be doubled.tygrus - Thursday, December 3, 2020 - link
Register rename means they already have more registers that don't need to be copied. The register renaming means they have more physical registers than logical registers exposed to programmer. Say you have: 16 logical registers exposed to coder per thread; 128 rename registers in HW; SMT 2tgreads/core = same 16 logical but each thread has 64 rename registers instead of 128.Compare mixing the workloads eg. 8 int/branch heavy with 8 FP heavy on 8 core; or OS background tasks like indexing/search/AntiVirus.
MrSpadge - Thursday, December 3, 2020 - link
The 5% is from Intel for the original Pentium 4. At some point in the last 10 years I think I read a comparable number, probably here on AT, regarding a more modern chip.Wilco1 - Friday, December 4, 2020 - link
There is little accurate info about it, but the fact is that x86 cores are many times larger than Arm cores with similar performance, so it must be a lot more than 5%. Graviton 2 gives 75-80% of the performance of the fastest Rome at less than a third of the area (and half the power).