How do you keep increasing performance in a power constrained environment like a smartphone without decreasing battery life? You can design more efficient microarchitectures, but at some point you’ll run out of steam there. You can transition to newer, more power efficient process technologies but even then progress is very difficult to come by. In the past you could rely on either one of these options to deliver lower power consumption, but these days you have to rely on both - and even then it’s potentially not enough. Heterogeneous multiprocessing is another option available - put a bunch of high performance cores alongside some low performance but low power cores and switch between them as necessary.

NVIDIA recently revealed it was doing something similar to this with its upcoming Tegra 3 (Kal-El) SoC. NVIDIA outfitted its next-generation SoC with five CPU cores, although only a maximum of four are visible to the OS. If you’re running light tasks (background checking for email, SMS/MMS, twitter updates while your phone is locked) then a single low power Cortex A9 core services those needs while the higher performance A9s remain power gated. Request more of the OS (e.g. unlock your phone and load a webpage) and the low power A9 goes to sleep and the 4 high performance cores wake up. 

While NVIDIA’s solution uses identical cores simply built using different transistors (LP vs. G), the premise doesn’t change if you move to physically different cores. For NVIDIA, ARM didn’t really have a suitable low power core thus it settled on a lower power Cortex A9. Today, ARM is expanding the Cortex family to include a low power core that can either be used by itself or as an ISA-compatible companion core in Cortex A15 based SoCs. It’s called the ARM Cortex A7.

Architecture

Starting with the Cortex A9, ARM moved to an out-of-order execution core (instructions can be reordered around dependencies for improved parallelism) - a transition that we saw in the x86 space back in the days of the Pentium Pro. The Cortex A15 continues the trend as an OoO core but increases the width of the machine. The Cortex A7 however takes a step back and is another simple in-order core capable of issuing up to two instructions in parallel. This should sound a lot like the Cortex A8, however the A7 is different in a number of areas.

The A8 is a very old design with work originally beginning on the core in 2003. Although ARM offered easily synthesizable versions of the core, in order to hit higher clock speeds you needed to include a lot of custom logic. The custom design requirements on A8 not only lengthened time to market but also increased development costs, limiting the A8’s overall reach. The Cortex A7 on the other hand would have to be fully synthesizable while being able to deliver good performance. ARM could leverage process technology advancements over the past few years to deliver clock speed and competitive power consumption, but it needed a revised architecture to meet the cost and time to market requirements.

The Cortex A7 features an 8-stage integer pipeline and is capable of dual-issue. Unlike the Cortex A8 however, the A7 cannot dual-issue floating point or NEON instructions. There are other instructions that turn the A7 into a single-issue machine as well. The integer execution cluster is quite similar to the Cortex A8, although the FPU is fully pipelined and more compact than its older brother. 

Limiting issue width for more complex instructions helps keep die size in check, which was a definite goal for the core. ARM claims a single Cortex A7 core will measure only 0.5mm2 on a 28nm process. On an equivalent process node ARM expects customers will be able to implement an A7 in 1/3 - 1/2 the die area of a Cortex A8. As a reference, an A9 core uses about the same (if not a little less) die area as an A8 while an A15 is a bit bigger than both.

Architecture Comparison
  ARM11 ARM Cortex A7 ARM Cortex A8 ARM Cortex A9 Qualcomm Scorpion Qualcomm Krait
Decode single-issue partial dual-issue 2-wide 2-wide 2-wide 3-wide
Pipeline Depth 8 stages 8 stages 13 stages 8 stages 10 stages 11 stages
Out of Order Execution N N N Y Partial Y
Pipelined FPU Y Y N Y Y Y
NEON N/A Y (64-bit wide) Y (64-bit wide) Optional MPE (64-bit wide) Y (128-bit wide) Y (128-bit wide)
Process Technology 90nm 40nm/28m 65nm/45nm 40nm 40nm 28nm
Typical Clock Speeds 412MHz 1.5GHz (28nm) 600MHz/1GHz 1.2GHz 1GHz 1.5GHz

Despite the limited dual issue capabilities, ARM is hoping for better performance per clock and better overall performance out of the Cortex A7 compared to the Cortex A8. Branch prediction performance is improved partly by using a more modern predictor, and partly because the shallower pipeline lessens the mispredict penalty. The Cortex A7 features better prefetching algorithms to help improve efficiency. ARM also includes a very low latency L2 cache (10 cycles) with its Cortex A7 design, although actual latency can be configured by the partner during implementation.

Note that in decoding bound scenarios, the Cortex A7 will offer the same if not lower performance than a Cortex A8 due to its limited dual-issue capabilities. The mildly useful DMIPS/MHz ratings of ARM’s various cores are below:

Estimated Core Performance
  ARM11 ARM Cortex A7 ARM Cortex A8 ARM Cortex A9 Qualcomm Scorpion Qualcomm Krait
DMIPS/MHz 1.25 1.9 2.0 2.5 2.1 3.3

The big news is the Cortex A7 is 100% ISA compatible with the Cortex A15, this includes the new virtualization instructions, integer divide support and 40-bit memory addressing. Any code running on an A15 can run on a Cortex A7, just slower. This is a very important feature as it enables SoC vendors to build chips with both Cortex A7 and Cortex A15 cores, switching between them depending on workload requirements. ARM calls this a big.LITTLE configuration.

big.LITTLE: Heterogeneous ARM MP
Comments Locked

75 Comments

View All Comments

  • Manabu - Saturday, October 22, 2011 - link

    The decode portion of the chip also have to grow if one wants higher IPC/power efficiency. So it is probably more than 0.3% nowadays.
  • fic2 - Wednesday, October 19, 2011 - link

    The problem with Atom power wise was Intel stupidly decided to saddle it with a chipset that was 3 (?) fab generations behind (Atom: 45nm, chipset: 130nm) and used more power than the actual cpu. I don't know if they have corrected this part of the problem but it seems to be an Intel trait - get most of it right but screw up on the last mile thing. (Compare that to AMD that either gets it right (Zacate, Llano) or gets it very wrong (Bulldozer) or ARM which seems to get a most everything right).
  • fteoath64 - Thursday, October 20, 2011 - link

    Good point. Chipset development has been a secondary priority for Intel, this way they ensure the combo solution is good-enough for the market to make the volumes they intended. Looking into the past when other chipset makers like Nvidia, AMD and SiS, even VIA at some stage did a better chipset implementation than Intel.

    Atom was not aggressive enough to leverage low-power and has little integration of other SoC components like most of the chipset features. At least most of the north/south bridges leaving only external I/O interfaces. A lousy slow GPU is their burden, so plenty of legacy not solved.

    Intel had better be careful because ARM A15 has the capability to upset X86 in future by software emulation with multicore heterogeneous chips.
  • fteoath64 - Thursday, October 20, 2011 - link

    Well said. Atom has already sunk, just that Intel is in denial mode. I was suggesting that Intel swallows their pride and GET an Arm license so they can design and manufacture these chips and compete with Qualcomm and Samsung in the ARM market. At least it will give them some volume game instead of having zero in mobile.
    If they continue with Atom architecture, they will learn a costly mistake later on. This way they can evolve a single core SB low power and maybe a bare-bones atom core small.BIG evolution. It could just secures their Win8 tablets for X86 (if that market ever develops...).
  • B3an - Wednesday, October 19, 2011 - link

    A GPU is not a CPU. And intel have more than once before that they are NOT capable of matching ARM. And i dont expect them to for a long time, if ever.
  • Wolfpup - Wednesday, October 19, 2011 - link

    Yeah, I was thinking "hey, Intel could stick an Atom on a better chip" before he said that.

    This IS very interesting (as is what Nvidia is doing even before this), but it's interesting because it's an interesting idea...I don't see how it effects Intel one way or the other. Obviously if other companies can do it, Intel can do it too.
  • JPForums - Thursday, October 20, 2011 - link

    <quote>It's this kind of heterogeneous SoC structure with different CPU architectures on a single die that will nail Intel to the wall and prevent them from ever really penetrating the mobile market. </quote>

    The idea of heterogeneous architectures isn't new. ARM is simply applying them differently. IBM's cell processor (used in the PS3) uses a combination of general purpose processing core(s) and specialized lighter weight cores. Quite a while back Intel's vision of the future involved processors with a combination of a few complex heavy weight cores and many lightweight cores (think Larrabee or similar). With power saving features largely complete and an upcoming GPU that is supposed to be competitive I wouldn't be surprised if Intel started to get more serious about bringing this to market. They have already made great strives with their tera-scale research: 48 core single chip, 80 core research chip.

    What ARM did that was innovative, was to use a heterogeneous architecture for the purposes of power savings, and to make it appear homogenous. I would argue that with Intel's focus on power gating and other power saving features, the idea of using a heterogeneous architecture to save power hasn't escaped them. However, full instruction set compatibility between the two architectures makes things much simpler as the different cores remain largely transparent to the OS and applications. While it isn't really that hard to develop separate code paths to use more efficient instructions when available, this does raise the complexity on the OS for thread scheduling. Hiding these cores is mostly a convenience, though. It puts the burden of moving to a lower power core largely on the chip and again reduces the complexity of the thread scheduler.

    A more effective use of heterogeneous architectures would be to reveal the presence of all cores to the OS and to individually power gate them. (Individual power states would be even better.) This would allow the use of lower power cores any time for threads that don't require higher performance. I.E. two high performance apps and low OS background tasks would take place on two A15 cores and an A7 rather than three A15s. Further, once the OS starts intelligently assigning tasks to processors, it can become advantageous to have slight differences in the architectures of some cores to support specific tasks.

    I see this move as a necessary one to get OS makers and app developers thinking along the lines of heterogeneous processing while providing a progressive move over path. Intel's I64 architecture failed largely due to the fact that it forced a clean break from past applications. AMD's A64 architecture succeeded because people didn't have to leave behind old applications and code going forward (at least no until they were ready to). That said, I don't think ARM intends to stop here long term. While a cell like approach with significantly different cores would be less than optimal, smaller differences like the lack of full NEON or SSE4 support on lower power cores shouldn't be much of a burden once the OS/apps are smart enough to route threads to a core with the necessary units available.
  • MJEvans - Wednesday, October 19, 2011 - link

    This is exactly the kind of competition that the market has needed. AMD used to be able to keep up to Intel's heals with intelligent decisions and hand-tuning to make the most of being on a mature process node as opposed to a cutting/bleeding edge one. ARM's decisions here represent the logic of applying that at a macroscopic (architectural) level.
  • Arnulf - Wednesday, October 19, 2011 - link

    I was contemplating this few months ago before Kal-El was described in media and existence of its extra core revealed to public. Something along the lines of 2x Atom + 2x Sandy Bridge cores, with all cores visible to the OS.

    OS should be able to identify each core and allocate the workload to it accordingly - i.e. OS would grab one of the weaker cores for itself, schedule CPU intensive processes on more powerful cores and have one low-power core in reserve just in case (for antivirus etc.).

    This would result in CPUs with maximum TDP only ~5-10W above existing models yet it would allow for far less conext switching. There is no point in hiding those weaker cores from the OS; instead OS should be intelligent enough to to utilize them to the fullest extent.
  • psychobriggsy - Thursday, October 20, 2011 - link

    For seamless *running* application migration between the different core types, they should both support the exact same instruction set extensions, which currently Atom and SB don't. I don't think that AMD's Bobcat and Bulldozer do either.

    I wouldn't say no to a chip comprising of a Bulldozer module or two (like Trinity), and a couple of Bobcat cores as well for lower-power modes. This would surely save a lot of power over even Bulldozer in its lowest operational clock/power state.

    However neither AMD nor Intel can compete in power against this ARM technology - A15 for power (around Bobcat performance per core) and A7 for power saving (around 1GHz Atom performance per core I would imagine). As soon as Intel takes a step towards lower power with Atom, ARM moves the goalposts. Even an Atom core implemented at 22nm can't compete with a 28nm 0.5mm^2 core... which is practically free in terms of silicon (even with a small L2 cache added on top).

Log in

Don't have an account? Sign up now