Intel’s 2021 Exascale Vision in Aurora: Two Sapphire Rapids CPUs with Six Ponte Vecchio GPUs

by Dr. Ian Cutress on November 17, 2019 7:01 PM EST

43 Comments | Add A Comment

43 Comments

For the last few years, when discussing high performance computing, the word 'exascale' has gone hand-in-hand with the next generation of supercomputers. Even last month, on 10/18, HPC twitter was awash with mentions of ‘Exascale Day’, signifying 10^18 operations per second (that’s 10 million million million). The ‘Drive to Exascale’ is one of the key targets in the next decade for the supercomputing market, and Intel is going in with the Aurora Supercomputer ordered by the Argonne National Laboratory.

Intel has been working on the Aurora contract for a while now, and the project has changed scope over time due to market changes and hardware setbacks. Initially announced several years ago as a ‘deliver by 2020’ project between Argonne, Cray, and Intel, the hardware was set to be built around Intel’s Xeon Phi platform, providing high-throughput acceleration though Intel's AVX-512 instructions and the 10nm Knights Hill accelerator. This announcement was made before the recent revolution in AI acceleration, as well as Intel subsequently killing off the Xeon Phi platform after adding AVX-512 to its server processors (with the last breath Knights Mill receiving a very brief lifespan). Intel had to go back to the drawing board, and since Xeon Phi was scrapped, Intel’s audience has been questioning what it will bring to Aurora, and until today the company had only said that it would be built from a combination of Xeon CPUs and Xe GPUs.

An Aurora Node

As part of today’s announcement, Intel has put some information on the table for a typical ‘Aurora’ compute note. While not giving any specifics such as core counts or memory types, the company stated that a standard node will contain two next generation CPUs and six next generation GPUs, all connected via new connectivity standards.

Those CPUs will be Sapphire Rapids CPUs, Intel’s second generation of 10nm server processors coming after the Ice Lake Xeons. The announcement today reaffirmed that Sapphire Rapids is a 2021 processor; and likely a late 2021 processor, as the company also confirmed that Ice Lake will have its volume ramp through late 2020. Judging from Intel's images, Sapphire Rapids is set to have eight memory channels per processor, with enough I/O to connect to three GPUs. Within an Aurora node, two of these Sapphire Rapids CPUs will be paired together, and support the next generation of Intel Optane DC Persistent Memory (2^nd Gen Optane DCPMM). We already know from other sources that Sapphire Rapids is likely to be DDR5 as well, although I don't believe Intel has said that outright at this point.

On the GPU side, all six of the GPUs per node will be Intel’s new 7nm Ponte Vecchio Xe GPU. As we covered in the Ponte Vecchio (PV) announcement, the basis of PV will be in small 7nm chiplets built on a microarchitecture variant of the X^e architecture. PV will employ a key number of Intel’s packaging technologies, such as Foveros (die stacking), Intel’s Embedded Multi-Die Interconnect Bridge (EMIB), and high bandwidth cache and memory. With regards to capabilities, Intel has only stated that PV will have a vector matrix unit and high double precision performance, which is likely going to be required for the research that Argonne performs.

The other core technology in an Aurora node is the use of the new Compute eXpress Link (CXL) standard. CXL will allow the CPUs and GPUs to connect directly together and work within a unified memory space. Intel stated in our briefings that the GPUs will be connected in an all-to-all topology, even though the imagery they provided doesn’t show that. In the diagram it shows every PV GPU connected to three others, with the top GPUs connected directly to the CPU. It is unclear if the CPUs talk to each other through CXL at this point, and what parts of the CXL standard Aurora will be using, however as CXL is based on the PCIe 5.0 physical standard, it will be interesting to see how (or if) Intel produces hardware for each interconnect standard.

Each Aurora node will have 8 fabric endpoints, giving plenty of topology connectivity options. And with the system being built in part by Cray, connecting the systems will be a version of their Slingshot networking architecture, which is also being used for the other early-2020s US supercomputers. Intel has stated that Slingshot will connect ~200 racks for Aurora, featuring a total of 10 petabytes of Memory and 230 petabytes of Storage.

Guessing Ponte Vecchio Performance

With this information, if we spitball some numbers on performance, here's what we end up with:

Going with Intel's number of 200 racks
Assume each rack is a standard 42U,
Each Aurora node is a standard 2U,
We know the system has 200 racks.
Take out 6U per rack for networking and support,
Take out 1/3 of the racks for storage and other systems
We get a rounded value of 2400 total Aurora nodes (2394 based on assumptions).

This means we get just south of 5000 Sapphire Rapids CPUs and 15000 Ponte Vecchio GPUs for the whole of Aurora. Now when calculating the total performance of a supercomputer, the CPU and Accelerator performance often depends on the workload required. If Ponte Vecchio is truly an exascale class GPU, then let’s assume that the GPUs are where the compute is. If we divide 1 ExaFLOPs by 15000 units, we’re looking at 66.6 TeraFLOPs per GPU. Current GPUs will do in the region of 14 TF on FP32, so we could assume that Intel is looking at a ~5x increase in per-GPU performance by 2021/2022 for HPC. Of course, this says nothing about power consumption, and if we were to do the same math at 4U per node, then it would be ~7500 GPUs, and we would need 135 TF per GPU to reach exascale.

Delivering The Hardware

Despite initially being a ‘deliver by 2020’ project, Intel is now saying that Aurora will be delivered in 2021. This means that Intel has to execute on the following:

Manufacturing at 10++ in sufficient yield with respect to cores/frequencies
Manufacturing at 7nm for chiplets, both in yield and frequencies
Transition through DDR3 to DDR4 (and DDR5?) in that time frame
Transition through PCIe 3.0 to PCIe 4.0 and PCIe 5.0 in that time frame
Release and detail its 2^nd generation Optane DC Persistent Memory
Provide an SDK to deal with all of the above
Teach researchers to use it

On those last two points, Intel has stated that Ponte Vecchio and Aurora will be a primary beneficiary of the company’s new OneAPI SDK. This industry initiative, spearheaded by Intel, is designed to use a singular cross-architecture language called ‘Data Parallel C++’, based on C++ and SYCL, that can pull libraries designed to speak to various elements of Intel’s hardware chain. The idea is that software designs can write the code once, link appropriate libraries, and then cross-compile for different Intel hardware targets. In Intel’s parlance, this is the ‘xPU’ strategy, covering CPU, GPU, FPGA, and other accelerators. Aside from Argonne/Cray, Intel is citing Lenovo and Atos as key partners in the OneAPI strategy.

In order to achieve a single exascale machine, you need several things. First is hardware – the higher performance a single unit of hardware, the fewer you need and the less infrastructure you need. Second is infrastructure, and being able to support such a large machine. Third is the right problem that doesn’t fall afoul of Amdahl’s Law – something that is so embarrassingly parallel that can be spread across all the hardware in the system is a HPC scientist dream. Finally, you need money. Buckets of it.

In the past few years, the top supercomputers in the world have addressed all of these requirements by orders of magnitude: hardware is faster, infrastructure is readily available, traditionally serial problems are being re-architected to be parallel (or lower precision), and governments are spending more than ever for single supercomputers. A $5 million supercomputer used to get a research group into the top echelons of the Top 500 list. Now it’s more like $150+ million. The current leader, Summit, had an estimated build cost of $200 million. And Aurora will handily top that, with the deal valued at more than $500 million.

When I asked before the announcement if Argonne would be the first customer of Ponte Vecchio, the Intel executive heading the briefing dodged the question by instead answering his own, saying ‘we’re not disclosing other customers’. In all my time as a journalist, I don’t think I’ve ever had a question unanswered and dodged in such a way. Unfortunately I had no opportunity in the open Q&A session to follow up.

Ultimately, based on what we know so far, Intel still has a lot of work to do to deliver this by 2021. It may be the case that Aurora is that culmination of all its technologies, and that Intel would be prepared to use the opportunity to help smooth out any issues that might arise. At any rate, some of these technologies might not get in the hands of the rest of us until 2022.

Hopefully, when Aurora is up and running, we’ll get a chance to take a look. Watch this space.

43 Comments

View All Comments

mdriftmeyer - Monday, November 18, 2019 - link
Intel pushing the goal posts back is nothing new. The fact that Anandtech isn't critical of these `claims' versus reality is nothing new.
Yojimbo - Monday, November 18, 2019 - link
Regarding the Aurora Supercomputer begin the first customer of Ponte Vecchio, my opinion that the answer was dodged in such a way is because the answer is a resounding yes. There is so much new technology going into this that it cannot possibly be profitable for commercial usage: a new 7 nm node, EMIB, chip stacking, CXL, and last but not least, Intel's first attempt at a compute-focused GPU built on a mostly untested GPU architecture. Then one must consider that the software ecosystem won't exist for commercial GPU compute, yet. Aurora will be programmed with compiler directives, but commercial users will want to be able to develop their own codes based on lower-level APIs. There will be a lot of work to do to enable that to be done smoothly and productively. I think that the first users, and perhaps only users with much volume, of these Ponte Vecchio GPUs will be HPC shops, and no HPC shop is going to get these GPUs before Argonne.
Kevin G - Monday, November 18, 2019 - link
This makes me wonder how big the chiplet dies are. Intel was hit hard with 10 nm issues with troubles making the first Cannon Lake which was only ~71 mm^2. Eight of those is a 'mere' 568 mm^2 which is possible today on current nodes. EUV is supposed to reduce the maximum area they can manufacture so that may not be possible anymore at 7 nm (and EUV is going to be another first here for Intel too).

However, I would argue that Intel will have the packaging side of things squared up in time. This isn't their first EMIB product nor will it be there first Forveros product either. CXL will ship first in Sapphire Rapids, mainly because it has to be but the first Intel peripheral to use it will likely be a 400G Ethernet controller. Intel has already licensed out CXL so even the expected Intel CXL enabled NIC may not be the first.

The real gamble is going to be the oneAPI libraries. Intel has a tradition of being horrible when it comes to the software side of things. Larrabee died as a GPU because they simple couldn't get the software to work as desired. Intel never figured out the compiler magic to get Itanium to be competitive.

Intel is doing a lot but they have done much of it before at a smaller scale.
JayNor - Tuesday, November 19, 2019 - link
Intel plans to offer pcie5 and cxl in an Agilex I series, announced back in April. It would make sense for Intel to use these for emulation and perhaps they will also appear ahead of the Aurora project. Intel has been sampling the F series since August. I've seen reports that the I series features were demoed in the lab.

https://www.intel.com/content/www/us/en/products/p...
JayNor - Tuesday, November 19, 2019 - link
"In addition to the DOE's Aurora supercomputer, Lenovo and Atos also plan to build HPC platforms using Intel's Xeon processors, Xe GPUs and oneAPI unified programming layer, according to Intel."

https://www.crn.com/news/components-peripherals/in...
del42sa - Monday, November 18, 2019 - link
10nm +++ :-D
yeeeeman - Monday, November 18, 2019 - link
What is the problem with improving what you already have?
If you compare first gen 14nm products (aka Broadwell with latest (Comet Lake), you will see a huge improvement in both frequency and power.
Spunjji - Monday, November 18, 2019 - link
The joke is that they always did it, but only recently did they feel the need to start adding the +++ for marketing purposes.
name99 - Monday, November 18, 2019 - link
It depends on your business model.
For TSMC it makes sense because they support a wide range of customers operating on a range of lithographies.
For Intel it's more of a problem, to the extent that optimizing 14nm does not get you much (or any) insight into how to improve 10nm or 7nm. Intel's current business does not really operate on providing a wide range of processes (eg still shipping stuff on 14nm+++ at the same time as 7nm). Sure they can try to wiggle around it with chiplets, but its unclear that the demand numbers really balance.

In other words it's a problem in that for TSMC each dollar invested in that way (optimizing non-leading edge process) results in a long stream of revenue. For Intel each such dollar only results in a short burst of revenue.
This doesn't matter as long as Intel has a monopoly. But that monopoly appears to be crumbling... (No, no, our recent halving of Xeon prices had absolutely nothing to do with AMD, not at all. We just wanted to show our customers how much we appreciate them.)
JayNor - Tuesday, November 19, 2019 - link
In the most recent quarterly cc, Intel stated that their 14nm profitability will remain high due to the manufacturing equipment being almost completely depreciated. Here's the quote:

"The third thing is just George flagged this I just simplified. There's no transition and for us no transition next year is going to be 14-nanometer, we'll be a little better in terms of its profitability. Yields won't be dramatically different because we're extremely mature. But depreciation levels will be lower, because a lot of these tools have been fully depreciated, there because we've been on that node for so long. So, for the node transition 14 will be a little bit better."

https://seekingalpha.com/article/4298931-intel-cor...

Intel’s 2021 Exascale Vision in Aurora: Two Sapphire Rapids CPUs with Six Ponte Vecchio GPUs

An Aurora Node

Guessing Ponte Vecchio Performance

Delivering The Hardware

Related Reading

Post Your Comment

43 Comments

View All Comments

mdriftmeyer - Monday, November 18, 2019 - link

Yojimbo - Monday, November 18, 2019 - link

Kevin G - Monday, November 18, 2019 - link

JayNor - Tuesday, November 19, 2019 - link

JayNor - Tuesday, November 19, 2019 - link

del42sa - Monday, November 18, 2019 - link

yeeeeman - Monday, November 18, 2019 - link

Spunjji - Monday, November 18, 2019 - link

name99 - Monday, November 18, 2019 - link

JayNor - Tuesday, November 19, 2019 - link

Log in

Don't have an account? Sign up now