Pascal’s Architecture: What Follows Maxwell

With the launch of a new generation of GPUs we’ll start things off where we always do: the architecture.

Discrete GPUs occupy an interesting space when it comes to the relationship between architecture and manufacturing processes. Whereas CPUs have architecture and manufacturing process decoupled – leading to Intel’s aptly named (former) tick-tock design methodology – GPUs have aligned architectures with manufacturing processes, with a new architecture premiering alongside a new process. Or rather, GPU traditionally did. Maxwell threw a necessary spanner into all of this, and in its own way Pascal follows this break from tradition.

As the follow-up to their Kepler architecture, with Maxwell NVIDIA introduced a significantly altered architecture, one that broke a lot of assumptions Kepler earlier made and in the process vaulted NVIDIA far forward on energy efficiency. What made Maxwell especially important from a development perspective is that it came not on a new manufacturing process, but rather on the same 28nm process used for Kepler two years earlier, and this is something NVIDIA had never done before. With the 20nm planar process proving unsuitable for GPUs and only barely suitable for SoCs – the leakage from planar transistors this small was just too high – NVIDIA had to go forward with 28nm for another two years. It would come down to their architecture team to make the best of the situation and come up with a way to bring a generational increase in performance without the traditional process node shrink.

Now in 2016 we finally have new manufacturing nodes with the 14nm/16nm FinFET processes, giving GPU manufacturers a long-awaited (and much needed) opportunity to bring down power consumption and reduce chip size through improved manufacturing technology. The fact that it has taken an extra two years to get here, and what NVIDIA did in the interim with Maxwell, has opened up a lot of questions about what would follow for NVIDIA. The GPU development process is not so binary or straightforward that NVIDIA designed Maxwell solely because they were going to be stuck on the 28nm process – NVIDIA would have done Maxwell either way – but it certainly was good timing to have such a major architectural update fall when it did.

So how does NVIDIA follow-up on Maxwell then? The answer comes in Pascal, NVIDIA’s first architecture for the FinFET generation. Designed to be built on TSMC’s 16nm process, Pascal is the latest and the greatest, and like every architecture before it is intended to further push the envelope on GPU performance, and ultimately push the envelope on the true bottleneck for GPU performance, energy efficiency.

HPC vs. Consumer: Divergence

Pascal is an architecture that I’m not sure has any real parallel on a historical basis. And a big part of that is because to different groups within NVIDIA, Pascal means different things and brings different things, despite the shared architecture. On the one side is the consumer market, which is looking for a faster still successor to what Maxwell delivered in 2014 and 2015. Meanwhile on the high performance compute side, Pascal is the long-awaited update to the Kepler architecture (Maxwell never had an HPC part), combining the lessons of Maxwell with the specific needs of the HPC market.

The result is that there’s an interesting divergence going on between the HPC side and its GP100 GPU, and the consumer side and the GP104 GPU underlying GTX 1080. Even as far back as Fermi there was a distinct line separating HPC-class GPUs (GF100) from consumer/general compute GPUs (GF104), but with Pascal this divergence is wider than ever before. Ultimately the HPC market and GP100 is beyond the scope of this article and I’ll pick it up in detail another time, but because NVIDIA announced GP100 before GP104, it does require a bit of addressing to help sort out what’s going on and what NVIDA’s design goals were with GP104.

Pascal as an overarching architecture contains a number of new features, however not all of those features are present in all SKUs. If you were to draw a Venn diagram of Pascal, what you would find is that the largest collection of features are found in GP100, whereas GP104, like the previous Maxwell architecture before it, is stripped down for speed and efficiency. As a result while GP100 has some notable feature/design elements for HPC – things such faster FP64 & FP16 performance, ECC, and significantly greater amounts of shared memory and register file capacity per CUDA core – these elements aren’t present in GP104 (and presumably, future Pascal consumer-focused GPUs).

Ultimately what we’re seeing in this divergence is a greater level of customization between NVIDIA’s HPC and consumer markets. The HPC side of NVIDIA is finally growing up, and it’s growing fast. The long term plan at NVIDIA has been to push GPU technology beyond consumer and professional graphics, and while it has taken years longer than NVIDIA originally wanted, thanks in big part to success in the deep learning market, NVIDIA is finally achieving their goals.

This means that although GP100 is a fully graphics capable GPU, it doesn’t necessarily have to be put into video cards to make sense for NVIDIA to manufacture, and as a result NVIDIA can make it even more compute focused than prior-generation parts like GK110 and GF110. And that in turn means that although this divergence is driven by the needs of the HPC market – what features need to be added to make a GPU more suitable for HPC use cases – from the perspective of the consumer market there is a tendency to perceive that consumer parts are falling behind. Especially with how GP100 and GP104’s SMs are differently partitioned.

This is a subject I’ll revisit in much greater detail in the future when we focus on GP100. But for now, especially for the dozen of you who’ve emailed over the past month asking about why the two are so different, the short answer is that the market needs for HPC are different from graphics, and the difference in how GP100 and GP104 are partitioned reflect this. GP100 and GP104 are both unequivocally Pascal, but GP100 gets smaller SM partitions in order to increase the number of registers and the amount of shared memory available per CUDA core. Shared memory and register contention on graphics workloads isn’t nearly as great as with HPC tasks – pixel shader threads are relatively short and independent from each other – which means that while the increased ratios benefit HPC workloads, for graphics the gains would be minimal. And the costs to power and die space would, in turn, far outweigh any benefits.

The GTX 1080 & GTX 1070 Review GP104: The Heart of GTX 1080


View All Comments

  • Ryan Smith - Friday, July 22, 2016 - link

    2) I suspect the v-sync comparison is a 3 deep buffer at a very high framerate. Reply
  • lagittaja - Sunday, July 24, 2016 - link

    1) It is a big part of it. Remember how bad 20nm was?
    The leakage was really high so Nvidia/AMD decided to skip it. FinFET's helped reduce the leakage for the "14/16"nm node.

    That's apples to oranges. CPU's are already 3-4Ghz out of the box.

    RX480 isn't showing it because the 14nm LPP node is a lemon for GPU's.
    You know what's the optimal frequency for Polaris 10? 1Ghz. After that the required voltage shoots up.
    You know, LPP where the LP stands for Low Power. Great for SoC's but GPU's? Not so much.
    "But the SoC's clock higher than 2Ghz blabla". Yeah, well a) that's the CPU and b) it's freaking tiny.

    How are we getting 2Ghz+ frequencies with Pascal which so closely resembles Maxwell?
    Because of the smaller manufacturing node. How's that possible? It's because of FinFET's which reduced the leakage of the 20nm node.
    Why couldn't we have higher clockspeeds without FinFET's at 28nm? Because power.
    28nm GPU's capped around the 1.2-1.4Ghz mark.
    20nm was no go, too high leakage current.
    16nm gives you FinFET's which reduced the leakage current dramatically.
    What does that enable you to do? Increase the clockspeed..
    Here's a good article
  • lagittaja - Sunday, July 24, 2016 - link

    As an addition to the RX 480 / Polaris 10 clockspeed
    GCN2-GCN4 VDD vs Fmax at avg ASIC
  • timchen - Thursday, July 21, 2016 - link

    Another question is about boost 3.0: given that we see 150-200 Mhz gpu offset very common across boards, wouldn't it be beneficial to undervolt (i.e. disallow the highest voltage bins corresponding to this extra 150-200 Mhz) and offset at the same time to maintain performance at lower power consumption? Why did Nvidia not do this in the first place? (This is coming from reading Tom's saying that 1060 can be a 60w card having 80% of its performance...) Reply
  • AnnonymousCoward - Thursday, July 21, 2016 - link

    NVIDIA, get with the program and support VESA Adaptive-Sync already!!! When your $700 card can't support the VESA standard that's in my monitor, and as a result I have to live with more lag and lower framerate, something is seriously wrong. And why wouldn't you want to make your product more flexible?? I'm looking squarely at you, Tom Petersen. Don't get hung up on your G-sync patent and support VESA! Reply
  • AnnonymousCoward - Thursday, July 21, 2016 - link

    If the stock cards reach the 83C throttle point, I don't see what benefit an OC gives (won't you just reach that sooner?). It seems like raising the TDP or under-voltaging would boost continuous performance. Your thoughts? Reply
  • modeless - Friday, July 22, 2016 - link

    Thanks for the in depth FP16 section! I've been looking forward to the full review. I have to say this is puzzling. Why put it on there at all? Emulation would be faster. But anyway, NVIDIA announced a new Titan X just now! Does this one have FP16 for $1200? Instant buy for me if so. Reply
  • Ryan Smith - Friday, July 22, 2016 - link

    Emulation would be faster, but it would not be the same as running it on a real FP16x2 unit. It's the same purpose as FP64 units: for binary compatibility so that developers can write and debug Tesla applications on their GeForce GPU. Reply
  • hoohoo - Friday, July 22, 2016 - link

    Excellent article, Ryan, thank you!

    Especially the info on preemption and async/scheduling.

    I expected the preemption mght be expensive in some circumstances, but I didn't quite expect it to push the L2 cache though! Still this is a marked improvement for nVidia.
  • hoohoo - Friday, July 22, 2016 - link

    It seems like the preemption is implemented in the driver though? Are there actual h/w instructions to as it were "swap stack pointer", "push LDT", "swap instruction pointer"? Reply

Log in

Don't have an account? Sign up now