The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads

Name: The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads
Item: The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads
Author: Johan De Gelas

by Johan De Gelas on March 31, 2016 12:30 PM EST

112 Comments | Add A Comment

112 Comments

Broadwell-EP: A 10,000 Foot View

What are the building blocks of a 22-core Xeon? The short answer: 24 cores, 2.5 MB L3-cache per core, 2 rings connected by 2 bridges (s-boxes) and several PCIe/QPI/home "agents".

The fact that only 22 of those 24 cores are activated in the top Xeon E5 SKU is purely a product differentiation decision. The 18 core Xeon E5 v3 used exactly the same die as the Xeon E7, and this has not changed in the new "Broadwell" generation.

The largest die (+/- 454 mm²), highest core (HCC) count SKUs still work with a two ring configuration connected by two bridges. The rings move data in opposite directions (clockwise/counter-clockwise) in order to reduce latency by allowing data to take the shortest path to the destination. The blue points indicate where data can jump onto the ring buses. Physical addresses are evenly distributed over the different cache slices (each 2.5 MB) to make sure that L3-cache accesses are also distributed, as a "hotspot" on one L3-cache slice would lower performance significantly. The L3-cache latency is rather variable: if the core is lucky enough to find the data in its own cache slice, only one extra cycle is needed (on top of the normal L1-L2-L3 latency). Getting a cacheline of another slice can cost up to 12 cycles, with an average cost of 6 cycles..

Meanwhile rings and other entities of the uncore work on a separate voltage plane and frequency. Power can be dynamically allocated to these entities, although the uncore parts are limited to 3 GHz.

Just like Haswell-EP, the Broadwell-EP Xeon E5 has three different die configurations. The second configuration supports 12 to 15 cores and is a smaller version (306mm²) of the third die configuration that we described above. These dies still have two memory controllers.

Otherwise the smallest 10 core die uses only one dual ring, two columns of cores, and only one memory controller. However, the memory controller drives 4 channels instead of 2, so there is a very small bandwidth penalty (5-10%) compared to the larger dies (HCC+MCC) with two memory controllers. The smaller die has a smaller L3-cache of course (25 MB max.). As the L3-cache gets smaller, latency is also a bit lower.

Cache Coherency

As the core count goes up, it gets increasingly complex to keep cache coherency. Intel uses the MESIF (Modified, Exclusive, shared, Invalid and Forward) protocol for cache coherency. The Home Agents inside the memory controller and the caching agents inside the L3-cache slice implement the cache coherency. To maintain consistency, a snoop mechanism is necessary. There are now no less than 4 different snoop methods.

The first, Early Snoop, was available starting with Sandy Bridge-EP models. With early snoop, caching agents broadcast snoop requests in the event of an L3-cache miss. Early snoop mode offers low latency, but it generates massive broadcasting traffic. As a result, it is not a good match for high core count dies running bandwidth intensive applications.

The second mode, Home Snoop, was introduced with Ivy Bridge. Cache line requests are no longer broadcasted but forwarded to the home agent in the home node. This adds a bit of latency, but significantly reduces the amount of cache coherency traffic.

Haswell-EP added a third mode, Cluster on Die (CoD). Each home agent has 14 KB directory cache. This directory cache keeps track of the contested cache lines to lower cache-to-cache transfer latencies. In the event of a request, it is checked first, and the directory cache returns a hit, snoops are only sent to indicated (by the directory cache) agents.

On Broadwell-EP, the dice are indeed split along the rings: all cores on one ring are one NUMA node, all other cores on the other ring make the second NUMA node. On Haswell-EP, the split was weirder, with one core of the second ring being a member of the first cluster. On top of that, CoD splits the processor in two NUMA nodes, more or less one node per ring.

The fourth mode, introduced with Broadwell EP, is the "home snoop" method, but improved with the use of the directory cache and yet another refinement called opportunistic snoop broadcast. This mode already starts snoops to the remote socket early and does the read of the memory directory in parallel instead of waiting for the latter to be done. This is the default snoop method on Broadwell EP.

This opportunistic snooping lowers the latency to remote memory.

These snoop modes can be set in the BIOS as you can see above.

Broadwell Reaches Xeon E5 Broadwell Architecture Improvements

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

112 Comments

View All Comments

Casper42 - Thursday, March 31, 2016 - link
HPE just dropped the 64GB LRDIMMs a week or two back.
They are now exactly 2x the 32GB LRDIMM as far as List Price goes.
LRDIMMs across the board are 31% more expensive than RDIMMs.
wishgranter - Tuesday, April 5, 2016 - link
http://www.techpowerup.com/221459/samsung-starts-m...
wishgranter - Tuesday, April 5, 2016 - link
While introducing a wide array of 10nm-class DDR4 modules with capacities ranging from 4GB for notebook PCs to 128GB for enterprise servers, Samsung will be extending its 20nm DRAM line-up with its new 10nm-class DRAM portfolio throughout the year.
nathanddrews - Thursday, March 31, 2016 - link
Perf/W is obviously a very exciting metric for server farmers and it generally exciting from a basic technology perspective, but it's absolute performance isn't amazing. Anyway, it's not like I'll be buying one anyway. LOL
asendra - Thursday, March 31, 2016 - link
This interest me in so far as this would be the updated processors in a supposedly-coming-this-year Mac Pro refresh. Not that I would personally fork that much cash, but I'm interested to see how much of a jump they will make.

But things seam rather bleak. No wonder they decided to wait 3 years for a refresh.
MrSpadge - Thursday, March 31, 2016 - link
Not sure which years you're counting in, but for the majority of us it takes 1.5 years from 09/2014 to today.
https://en.wikipedia.org/wiki/Haswell_%28microarch...
asendra - Thursday, March 31, 2016 - link
Apple didn't update the MacPros with Haswell-EP. They are still using Ivy Bridge
tipoo - Thursday, March 31, 2016 - link

Wonder what they'll do on the GPU side though. Too early for next generation 14nm FF GPUs from anyone, if Nvidia was even a choice due to OpenCL politics. Another GCN 1.0 part in 2016 would be...A bag of hurt.

Still waiting on the high end 15" rMBP to have something better than GCN 1.0...The performance, shockingly, hasn't come all that far from even my Iris Pro model. Maybe double, which is something, but I'd like larger than that to upgrade from integrated...
extide - Thursday, March 31, 2016 - link
Nah, if they refresh it late this year, like in august or something like that, then 14/16nm FF GPU's will be available.

At worst we would get GCN 1.2, but yeah it would suck to see 28nm GPU's put in there...
mdriftmeyer - Thursday, March 31, 2016 - link
On what planet do you not grasp FinFET 14nm end of June from AMD?

The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads

Broadwell-EP: A 10,000 Foot View

Cache Coherency

Post Your Comment

112 Comments

View All Comments

Casper42 - Thursday, March 31, 2016 - link

wishgranter - Tuesday, April 5, 2016 - link

wishgranter - Tuesday, April 5, 2016 - link

nathanddrews - Thursday, March 31, 2016 - link

asendra - Thursday, March 31, 2016 - link

MrSpadge - Thursday, March 31, 2016 - link

asendra - Thursday, March 31, 2016 - link

tipoo - Thursday, March 31, 2016 - link

extide - Thursday, March 31, 2016 - link

mdriftmeyer - Thursday, March 31, 2016 - link

Log in

Don't have an account? Sign up now