Next Stop: the Uncore

Continuing with our review of Haswell architecture, let's again take a step back and use the Xeon 5500 as our reference point. The Xeon 5500 is based on the "Nehalem" architecture, and it helped Intel become dominant in the server market. Before the Xeon 5500, AMD's Opteron was still able to outperform the Xeons in quite a few applications (HPC and virtualization for example), even by significant margins. That changed with Nehalem, so the Xeon 5500 is a good reference point.

7-zip Benchmark – Single Threaded

The 27% cumultative IPC (integer only) improvement of Haswell mentioned is more than just theory: Anand's review of the desktop Haswell CPUs confirmed this. The Haswell Core i7-4770k at the same clock speed is about 21% faster than Nehalem. Now that is below the promised 27% performance increase, but 7-zip is among the applications known to have very low IPC.

Let's go back to the server world. Instead of increasing the clock speeds, clock speeds have declined from 2.93-3.2GHz (Xeon 5500) to 2.3-2.6GHz for the latest high-end parts. However, when Turbo Boost is enabled, 2.8 – 3.1GHz is possible with all cores active. So the clock speed of the high end server CPUs is actually 5 to 20% lower and not 10% higher as in the desktop space. The gains Intel has made in IPC are thus partly negated by slightly lower clock speeds.

Clock speed has clearly been traded in for more cores in most of server SKUs. But the additional cores can prove extremely useful. The SAP S&D application – one of the best industry benchmarks – runs about three times faster (see further) on the latest Xeon E5-2699 v3 than on the Xeon 5500.

This clearly puts into perspective how important the uncore part is for Xeons. The uncore parts makes the difference between a CPU that is only good at running a few handpicked benchmarks (like SPECint rate) but fails to achieve much in real applications, vs. an attractive product that can lower the IT costs by running more virtual machines and offering services to more users.

Refresher: the Haswell Core The Magic Inside the Uncore


View All Comments

  • shodanshok - Tuesday, September 16, 2014 - link

    Please note that the RWT article you are endlessy posting is 10 (TEN!) years ago.

    SGI tell the extact contrary of what you reports:

    Altrix UV systems are shared memory system connecting the various boards (4-8 sockets per board) via QPI and NUMAlink. They basically are a distribuited version of your beloved scale-up server. After all, maximum memory limit is 16 TB, which is the address space _a single xeon_ can address.

    I am NOT saying that commodity X86 hardware can replace proprietary, big boxes in every environment. What I am saying it that the market nice for bix unix boxes is rapidly shrinking.

    So, to recap:
    1) in an article about Xeon E5 (4000$ max) you talk about the the mighty M7 (which are NOT available) with will probably cost 10-20X (and even T4/T5 are 3-5X);

    2) you speak about SPECInt2006 conveniently skipping about anything other that throughput, totalling ignoring latency and per-thread perf (and event in pure throughput Xeons are very competitive at a fraction of the costs)

    3) you totally ignore the fact that QPI and NUMAlink enable multi-board system to act as a single one, running a single kernel image within a shared memory environment.

    Don't let me wrong: I am not an Intel fan, but I must say I'm impressed with the Xeons it is releasing since 4 years (from Nehalem EX). Even their (small) Itanium niche is at risk, attacked by higher end E7 systems.

    Maybe (and hopefully) Power8 and M7 will be earth-shattering, but they will surely cost much, much more...

  • Brutalizer - Friday, September 19, 2014 - link

    This is folly. The link I post where SGI says their "Altix" server is only for HPC clustered workloads, applies also today to the "Altix" successor: the "Altix UV". Fact is that no large Unix or Mainframe vendor has successfully scaled beyond 32/64 sockets. And now SGI, a small cluster vendor with tiny resources, claims to have 256 socket server, with tiny resources compared to the large Unix companies?? Has SGI succeeded where no one else has, pouring decades and billions of R&D?

    As a response you post a link where SGI talks about their "Altix UV", and you claim that link as a evidence that the Altix UV server is not a cluster. Well, if you bothered to read your link, you would see that SGI has not change their viewpoint: it is only for HPC clustered workloads. For instance, "Altix UV" talks about MPI. MPI is only used in clusters, mainly for number crunching. I have worked with MPI in scientific computations, so I know this. No one would use MPI in a SMP server, such as the Oracle M7. Anyone talking about MPI, is also talking about clusters. For instance, enterprise software such as SAP does not use MPI.

    As a coup de grace, I quote text from your link about the latest "Altix UV" server:
    "...The key enabling feature of SGI Altix UV is the NUMAlink 5 interconnect, with additional performance characteristics contributed by the on-node hub and its MPI Offload Engine (MOE)...MOE is designed to take MPI process communications off the microprocessors, thereby reducing CPU overhead and lowering memory access latency, and thus improving MPI application performance and scalability. MOE allows MPI tasks to be handled in the hub, freeing the processor for computation. This highly desirable concept is being pursued by various switch providers in the HPC cluster arena;
    But fundamentally, HPC is about what the user can achieve, and it is this holy quest that SGI has always strived to enable with its architectures..."

    Maybe this is the reason you will not find SAP benchmarks on the largest "Altix UV" server? Because it is a cluster.

    But of course, you are free to disprove me by posting SAP benchmarks on a large Linux server with 10.000s of cores (i.e. clusters). I agree that if that SGI cluster runs SAP faster than a SMP 32-socket server - then it does not matter if SGI is cluster or not. The point is; clusters can not run all workloads, they suck at Enterprise workloads. If they can run Enterprise workloads, then I change my mind. Because, in the end, it does not matter how the hardware is constructed, as long as it can run SAP fast enough. But clusters can not.

    Post SAP benchmarks on a large Linux server. Go ahead. Prove me wrong when you say they are not clusters - in that case they would be able to handle non clustered workloads such as SAP. :)
  • shodanshok - Friday, September 19, 2014 - link

    Brutalizer, I am NOT (NOT!!!) saying that x86 is the best-of-world in scale-up performance. After all, it remain commodity hardware, and some choices clearly reflect that. For example, while Intel put single-image systems at as much as 256 sockets, the latency induced by the switchs/interconnect surely put the real number way lower.

    What I am saying in that the market that truly need big Unix boxes is rapidly shrinking, so your comment about how "mediocre" is this new 18-core monster are totally off place.

    Please note that:
    1) Altrix UV are SHARED MEMORY systems built out of clusters, where the "secret sauce" is the added tech behind NUMAlink. Will SAP run well on these systems? I think no: the NUMAlinks add too much latency. However, this same tech can be used in a number of cases where big unix boxes where the first choice (at least in SGI words, I don't have a similar system (unfortunately!) so I can't tell more;

    2) HP has just released the SAP HANA benchmarks for 16 sockets Intel E7 in scale-up configuration (read: single system) and 12/16 TB of RAM
    LINK1 :

    3) Even at 8 sockets, the Intel systems are very competitive. Please read here for some benchmarks:
    Long story short: an 8S Intel E7-8890 (15 cores @ 2.8 GHz) beat an 8S Oracle T5-8 (16 cores @ 3.6 GHz) by a significant margin. Now think about 18 Haswell cores...

    4) On top of that, event high-end E7 Intel x86 systems are way cheaper that Oracle/IBM box, while providing similar performances. The real differentation are the extreme RAS features integrated into proprietary unix boxes (eg: lockstep) that require custom, complex glue logic on x86. And yes, some unix boxes have impressive amount of memory ;)

    5) This article speak about *Haswell-EP*. They are one (sometime even two...) order of magnitude cheaper that proprietary unix boxes. So, why on earth in each Xeon article you complain about how mediocre is that technology?

  • Brutalizer - Monday, September 22, 2014 - link

    I hear you when you say that x86 has not the best scaleup performance. I am only saying that those 256-socket x86 servers you talk of, are in practice, nothing more than a cluster. Because they are only used for clustered HPC workloads. They will never run Enterprise business software as a large SMP server with 32/64 sockets - that domain are exclusive to Unix/Mainframe servers.

    It seems that we disagree on the 256-socket x86 servers, but agree on everything else (x86 are cheaper than RISC, etc). I claim they can only be used as clusters (you will only find HPC cluster benchmarks). So, those large Linux servers with 10.000 cores such as SGI Altix UV, are actually only usable as clusters.

    Regarding HP SAP HANA benchmarks with the 16-socket x86 server called ConvergedSystem 9000; it is actually a Unix Superdome server (a RISC server) where HP swapped all Itanium cpus to x86 cpus. Well, it is good that there are soon 16-sockets Linux servers available on the market. But HANA is a clustered database. I would like to see the HP ConvergedSystem server running non clustered Enterprise workloads - how well would the first 16-socket Linux server perform? We have to see. And then we can compare the fresh Linux 16-socket server to the mature 32/64-socket Unix/Mainframe servers in benchmarks and see which is fastest. A clustered Linux 256-socket server sucks on SMP benchmarks, it would be useless.
  • Brutalizer - Monday, September 22, 2014 - link
    "...The first of several systems that will bring technologies from Hewlett-Packard’s Superdome Itanium-based machines to big memory ProLiant servers based on Xeon processors is making its debut this week at SAP’s annual customer shindig.

    Code-named “Project Kraken,” the system is commercialized as the ConvergedSystem 900 for SAP HANA and as such has been tuned and certified to run the latest HANA in-memory database and runtime environment. The machine, part of a series of high-end shared memory systems collected known as “DragonHawk,” is part of a broader effort by HP to create Superdome-class machines out of Intel’s Xeon processors.

    The obvious question, with SAP allowing for HANA nodes to be clustered, is: Why bother with a big NUMA setup instead of a cluster? “If you look at HANA, it is really targeting three different workloads,” explains Miller. “You need low latency for transactions, and in fact, you can’t get that over a cluster...."
  • TiGr1982 - Tuesday, September 9, 2014 - link

    Our RISC scale-up evengelist is back!

    That's OK and very nice, nobody argues, but I guess one has to win a serious jackpot to afford one of these 32 socket Oracle SPARC M7-based machines :)

    Jokes aside, technically, you are correct, but Xeon E5 is obviously not about the very best scale-up on the planet, because Intel is aiming more at a mainstream server market. So, Xeon E5 line resides in a totally different price range than your beasty 32 socket scale-up, so what's the point of writing about SPARC M7 here?
  • TiGr1982 - Tuesday, September 9, 2014 - link

    Talking Intel, even Xeon E7 is much lower class line in terms of total firepower (CPU and RAM capability) than your beloved 32 socket SPARC Mx, and even Xeon E7 is much cheaper, than your Mx-32, so, again, what's the point of posting this in the article about E5? Reply
  • Brutalizer - Wednesday, September 10, 2014 - link

    The point is, people believes that building a huge SMP server with as many as 32-sockets is easy. Just add a few of Xeon E5 and you are good to go. That is wrong. It is exponentially more difficult to build a SMP server than a cluster. So, no one has ever sold such a huge Linux server with 32-sockets. (IBM P795 is a Unix server that people tried to compile Linux for, but it is not Linux server, it is a RISC AIX server) Reply
  • TiGr1982 - Wednesday, September 10, 2014 - link

    Well, I comprehend and understand your message, and I agree with you. Huge SMP scale-up servers are really hard to build, mostly because of the dramatically increasing complexity of the problem to implement the REALLY fast (both in terms of bandwidth and latency) interconnect between sockets in case when socket count grows considerably (say, up to 32), which is really required in order to get the true SMP machine.

    I hope, other people get your message too.

    BTW, I remember you already posted this kind of statements in the Xeon E7 v2 article comments before :-)
  • Brutalizer - Monday, September 15, 2014 - link

    "...I hope, other people get your message too...."

    Unfortunately, they dont. See "shodanshok" reply above, that the 256 socket xeon servers are not clusters. And see my reply, why they are.

Log in

Don't have an account? Sign up now