Maxwell: Designed For Energy Efficiency

While Maxwell doesn’t come with a significant overhaul of its high level feature set, the same cannot be said for the low level design of Maxwell. In fact the consistency at a high level betrays just how much work NVIDIA has done under the hood in order to improve their efficiency for Maxwell. Maxwell isn’t a complete overhaul of NVIDIA’s designs, nor is it even as aggressive as Kepler was when it eliminated Fermi’s hot clocks in favor of a wider design, but it has a number of changes that are important to understanding the architecture and more importantly understanding how NVIDIA is achieving their efficiency goals.

Broadly speaking, with Maxwell NVIDIA is almost solely focused on improving energy efficiency and performance per watt. This extends directly from NVIDIA’s mobile first design strategy for Maxwell, where the company needs to maximize energy efficiency in order to compete and win within the mobile space. If NVIDIA can bring down their energy consumption, then due to the power limiting factor we mentioned earlier they can use that recovered power overhead to further improve their performance. This again being especially noticeable in SoC-class products and discrete mobile due to the low power budgets these platforms provide.

To a lesser extent NVIDIA is also focused on space efficiency. GPU production costs and space efficiency go hand-in-hand, so there’s an interest in improving the density of their designs with Maxwell. This is especially the case when the earlier power savings allow for a wider GPU with a larger number of functional units within the same power envelope. Denser designs allow for NVIDIA to offer similar performance as larger Kepler GPUs (e.g. GK106) with a smaller Maxwell GPU.

To achieve this NVIDIA has taken a number of steps, some of which they’ve shared with us at a high level and some of which they haven’t. NVIDIA is taking a bit of a “secret sauce” approach to Maxwell from a design level, so while we know a fair bit about its execution model we don’t know quite as much about the little changes that add up to Maxwell’s energy and space savings. However NVIDIA tells us that overall they’ve been able to outright double their performance-per-watt on Maxwell versus Kepler, which is nothing short of amazing given the fact that all of this is being done on the same 28nm process as Kepler.

We’ll go over execution flow and the other gritty details on the next page, but for now let’s start with a look at NVIDIA’s Streaming Multiprocessor designs for Kepler (SMX) and Maxwell (SMM).

Immediately we can see a significant difference in the layout between the SMX and the new SMM. Whereas the SMX was for all practical purposes a large, flat design with 4 warp schedulers and 15 different execution blocks, the SMM has been heavily partitioned. Physically each SMM is still one contiguous unit, not really all that different from an SMX. But logically the execution blocks which each warp scheduler can access have been greatly curtailed.

The end result is that in an SMX the 4 warp schedulers would share most of their execution resources and work out which warp was on which execution resource for any given cycle. But on an SMM, the warp schedulers are removed from each other and given complete dominion over a far smaller collection of execution resources. No longer do warp schedulers have to share FP32 CUDA cores, special function units, or load/store units, as each of those is replicated across each partition. Only texture units and FP64 CUDA cores are shared.

Among the changes NVIDIA made to reduce power consumption, this is among the greatest. Shared resources, though extremely useful when you have the workloads to fill them, do have drawbacks. They’re wasting space and power if not fed, the crossbar to connect all of them is not particularly cheap on a power or area basis, and there is additional scheduling overhead from having to coordinate the actions of those warp schedulers. By forgoing the shared resources NVIDIA loses out on some of the performance benefits from the design, but what they gain in power and space efficiency more than makes up for it.

NVIDIA hasn’t given us hard numbers on SMM power efficiency, but for space efficiency a single 128 CUDA core SMM can deliver 90% of the performance of a 192 CUDA core SMX at a much smaller size.

Moving on, along with the SMM layout changes NVIDIA has also made a number of small tweaks to improve the IPC of the GPU. The scheduler has been rewritten to avoid stalls and otherwise behave more intelligently. Furthermore by achieving higher utilization of their existing hardware, NVIDIA doesn’t need as many functional units to hit their desired performance targets, which in turn saves on space and ultimately power consumption.

While on the subject of performance efficiency, NVIDIA has also been working on memory efficiency too. From a performance perspective GDDR5 is very powerful, however it’s also very power hungry, especially in comparison to DDR3. With GM107 in particular being a 128-bit design that would need to compete with the likes of the 192-bit GK106, NVIDIA has massively increased the amount of L2 cache they use, from 256KB in GK107 to 2MB on GM107. This reduces the amount of traffic that needs to cross the memory bus, reducing both the power spent on the memory bus and the need for a larger memory bus altogether.

Increasing the amount of cache always represents an interesting tradeoff since cache is something of a known quantity and is rather dense, but it’s only useful if there are memory stalls or other memory operations that it can cover. Consequently we often see cache implemented in relation to whether there are any other optimizations available. In some cases it makes more sense to use the transistors to build more functional units, and in other cases it makes sense to build the cache. After staying relatively stagnant on their cache sizes for so long, it looks like the balance has finally shifted and the cache increase makes the most sense for NVIDIA.

Of course even these changes are relatively high level from an ASIC perspective. There’s always the possibility for low-level changes and NVIDIA has followed through on these too. Case in point, both NVIDIA and AMD have been steadily improving their clock gating capabilities, and with Maxwell NVIDIA has taken another step in their designs. NVIDIA isn’t telling us just how fine grained their gating is now for Maxwell, but it’s a finer granularity than it was on Kepler. Given the new SM design, the most likely change was likely the ability to control the individual partitions and/or the functional units within those partitions, but this is just supposition on our part.

Finally there’s the lowest of low level optimizations, which is transistor level optimizations. Again NVIDIA hasn’t provided a ton of details here, but they tell us they’ve gone through at the transistor level to squeeze out additional energy efficiency as they could find it. Given that TSMC 28nm is now a very mature process with well understood abilities and quirks, NVIDIA should be able to design and build their circuits to a tighter tolerance now than they would have been able to when working on GK107 over 2 years ago.

Maxwell’s Feature Set: Kepler Refined GeForce GTX 750 Ti & GTX 750 Specifications & Positioning
POST A COMMENT

177 Comments

View All Comments

  • MrSpadge - Tuesday, February 18, 2014 - link

    To be fair GTX650Ti Boost consumes ~100 W in the real world. Still a huge improvement! Reply
  • NikosD - Tuesday, February 18, 2014 - link

    Hello.

    I have a few questions regarding HTPC and video decoding.

    Can we say that we a new video processor from Nvidia, a new name like VP6 or more like a VP5.x ?

    How Nvidia is calling the new video decoder ?

    Why don't you add a 4K60 fps clip in order to test soon to be released HDMI 2.0 output ?

    If you run a benchmark using DXVA Checker between VP5 and VP6 (?) how much faster is VP6 in H.264 1080p, 4K clips ?

    Thanks!
    Reply
  • Ryan Smith - Thursday, February 20, 2014 - link

    NVIDIA doesn't have a name for it; at least not one they're sharing with us. Reply
  • NikosD - Thursday, February 20, 2014 - link

    Thanks.
    Is it possible to try a 4K60fps with Maxwell ?

    I wonder if it can decode it in realtime...
    Reply
  • Flunk - Tuesday, February 18, 2014 - link

    I think these will be a lot more exciting in laptops. Even if they're no where near Nvidia's claimed 2x Kepler efficiency per watt. On the desktop it's not really that big a deal. The top-end chip will probably be ~40% faster than the 780TI but that will be a while. Reply
  • dylan522p - Tuesday, February 18, 2014 - link

    the 880 will be much more powerful than the 780ti. More than 40% even. They could literally die shrink and throw a few more SMX's and the 40% would be achieved. I would imagine either they are gonna have a HUGE jump (80% +) or they are gonna do what they did with Kepler and release a 200W Sku that is about 50% faster and when 20nm yields are good enough have the 900 series come with 250W Skus. Reply
  • Kevin G - Tuesday, February 18, 2014 - link

    Very impressive performance for its power consumption. I can see an underclocked version of this card coming with a passive cooler for HTPC solutions. Perhaps that'd be a hypothetical GT740? I'm surprised that nVidia hasn't launched a mobile version of this chip. It seems like it'd be ideal for midrange laptops that still have discrete graphics.

    I suspect that the extra overclocking headroom is in reserve for a potential rebrand to a GTX 800 series product. (Though a straight die shrink of this design to 20 nm would provide even more headroom for a GTX 800/900 card.) nVidia could have held back to keep it below the more expensive GTX 660.

    Though ultimately I'm left wanting the bigger GM100 and GM104 chips. We're going to have to wait until 20 nm is ready but considering the jump Maxwell has provided in the low end of the market, I'm eager to see what it can do in the high end.
    Reply
  • DanNeely - Tuesday, February 18, 2014 - link

    ASUS has a 65W TDP GT 640 with a big 2 slot passive heat sink (GT640-DCSL-2GD3); with the 750 Ti only hitting 60W a passive version of it should be possible at near stock performance. I suspect the 740 will be a farther cut down 3 SMM model which might allow a single slot passive design. Reply
  • PhoenixEnigma - Tuesday, February 18, 2014 - link

    Passive cooling was my first thought as well - I've been looking for something to replace the 6770 in my HTPC with, and I wanted something both faster and passively cooled. There are already passive 7750s on the market, and the numbers in Bench put the 750Ti at about 9W more than then 7750 under real world load, so a vanilla 750 with a passive cooler should be entirely possible. Even a 750Ti might be doable, but that could be pushing things a little far. Reply
  • evilspoons - Tuesday, February 18, 2014 - link

    I need a new half-height HTPC card, my 2.5 year old Asus Radeon 6570 bit the dust last month (sparkly picture, one particular shade of grey turned random colours). If they can work out the kinks in this thing and underclock it a bit, it sounds like a good candidate.

    It feels like it's been a long time since anything new showed up in the half-height video card game.
    Reply

Log in

Don't have an account? Sign up now