How do you keep increasing performance in a power constrained environment like a smartphone without decreasing battery life? You can design more efficient microarchitectures, but at some point you’ll run out of steam there. You can transition to newer, more power efficient process technologies but even then progress is very difficult to come by. In the past you could rely on either one of these options to deliver lower power consumption, but these days you have to rely on both - and even then it’s potentially not enough. Heterogeneous multiprocessing is another option available - put a bunch of high performance cores alongside some low performance but low power cores and switch between them as necessary.

NVIDIA recently revealed it was doing something similar to this with its upcoming Tegra 3 (Kal-El) SoC. NVIDIA outfitted its next-generation SoC with five CPU cores, although only a maximum of four are visible to the OS. If you’re running light tasks (background checking for email, SMS/MMS, twitter updates while your phone is locked) then a single low power Cortex A9 core services those needs while the higher performance A9s remain power gated. Request more of the OS (e.g. unlock your phone and load a webpage) and the low power A9 goes to sleep and the 4 high performance cores wake up. 

While NVIDIA’s solution uses identical cores simply built using different transistors (LP vs. G), the premise doesn’t change if you move to physically different cores. For NVIDIA, ARM didn’t really have a suitable low power core thus it settled on a lower power Cortex A9. Today, ARM is expanding the Cortex family to include a low power core that can either be used by itself or as an ISA-compatible companion core in Cortex A15 based SoCs. It’s called the ARM Cortex A7.

Architecture

Starting with the Cortex A9, ARM moved to an out-of-order execution core (instructions can be reordered around dependencies for improved parallelism) - a transition that we saw in the x86 space back in the days of the Pentium Pro. The Cortex A15 continues the trend as an OoO core but increases the width of the machine. The Cortex A7 however takes a step back and is another simple in-order core capable of issuing up to two instructions in parallel. This should sound a lot like the Cortex A8, however the A7 is different in a number of areas.

The A8 is a very old design with work originally beginning on the core in 2003. Although ARM offered easily synthesizable versions of the core, in order to hit higher clock speeds you needed to include a lot of custom logic. The custom design requirements on A8 not only lengthened time to market but also increased development costs, limiting the A8’s overall reach. The Cortex A7 on the other hand would have to be fully synthesizable while being able to deliver good performance. ARM could leverage process technology advancements over the past few years to deliver clock speed and competitive power consumption, but it needed a revised architecture to meet the cost and time to market requirements.

The Cortex A7 features an 8-stage integer pipeline and is capable of dual-issue. Unlike the Cortex A8 however, the A7 cannot dual-issue floating point or NEON instructions. There are other instructions that turn the A7 into a single-issue machine as well. The integer execution cluster is quite similar to the Cortex A8, although the FPU is fully pipelined and more compact than its older brother. 

Limiting issue width for more complex instructions helps keep die size in check, which was a definite goal for the core. ARM claims a single Cortex A7 core will measure only 0.5mm2 on a 28nm process. On an equivalent process node ARM expects customers will be able to implement an A7 in 1/3 - 1/2 the die area of a Cortex A8. As a reference, an A9 core uses about the same (if not a little less) die area as an A8 while an A15 is a bit bigger than both.

Architecture Comparison
  ARM11 ARM Cortex A7 ARM Cortex A8 ARM Cortex A9 Qualcomm Scorpion Qualcomm Krait
Decode single-issue partial dual-issue 2-wide 2-wide 2-wide 3-wide
Pipeline Depth 8 stages 8 stages 13 stages 8 stages 10 stages 11 stages
Out of Order Execution N N N Y Partial Y
Pipelined FPU Y Y N Y Y Y
NEON N/A Y (64-bit wide) Y (64-bit wide) Optional MPE (64-bit wide) Y (128-bit wide) Y (128-bit wide)
Process Technology 90nm 40nm/28m 65nm/45nm 40nm 40nm 28nm
Typical Clock Speeds 412MHz 1.5GHz (28nm) 600MHz/1GHz 1.2GHz 1GHz 1.5GHz

Despite the limited dual issue capabilities, ARM is hoping for better performance per clock and better overall performance out of the Cortex A7 compared to the Cortex A8. Branch prediction performance is improved partly by using a more modern predictor, and partly because the shallower pipeline lessens the mispredict penalty. The Cortex A7 features better prefetching algorithms to help improve efficiency. ARM also includes a very low latency L2 cache (10 cycles) with its Cortex A7 design, although actual latency can be configured by the partner during implementation.

Note that in decoding bound scenarios, the Cortex A7 will offer the same if not lower performance than a Cortex A8 due to its limited dual-issue capabilities. The mildly useful DMIPS/MHz ratings of ARM’s various cores are below:

Estimated Core Performance
  ARM11 ARM Cortex A7 ARM Cortex A8 ARM Cortex A9 Qualcomm Scorpion Qualcomm Krait
DMIPS/MHz 1.25 1.9 2.0 2.5 2.1 3.3

The big news is the Cortex A7 is 100% ISA compatible with the Cortex A15, this includes the new virtualization instructions, integer divide support and 40-bit memory addressing. Any code running on an A15 can run on a Cortex A7, just slower. This is a very important feature as it enables SoC vendors to build chips with both Cortex A7 and Cortex A15 cores, switching between them depending on workload requirements. ARM calls this a big.LITTLE configuration.

big.LITTLE: Heterogeneous ARM MP
Comments Locked

75 Comments

View All Comments

  • kgardas - Wednesday, October 19, 2011 - link

    While this is nice article I really hate people usually forget Marvell Armada 628 -- heterogenous triple-core and just talk about NVidia Kal-El.
    Ptherwise combination of single-core A7 + 4-8 cores of A15 looks like the killer SoC. :-)
  • fteoath64 - Thursday, October 20, 2011 - link

    @Kgardas: While the chip looks good on paper, it supports USB3 which is no use in mobile, so it is likely to be used in NAS, MediaPlayer and STB applications. The chip also looks big in size so I guess its power consumption on full bore is significant in ARM terms.
  • introiboad - Thursday, October 20, 2011 - link

    The OMAP family also does this with Cortex-A and Cortex-M cores put together in a die. It's quite similar to what ARM describes here, except of course the instruction sets are not the same
  • lancedal - Wednesday, October 19, 2011 - link

    But is very hard to realize. Today's OS are not application aware. Meaning it would not know if a thread is from application X and need 1GHz vs. another thread from application Y that only require 1MHz. As such, it would not be able to dynamically moving thread from one core to another without guarantee not missing deadline.
    If the small core is dedicate to do housekeeping thread only (i.e. sync, standby etc), that is all good but there is no need to do that anyway because such tasks are so infrequency (every hundred ms or so). Therefore, you can wakeup the big core, and shut it down.
  • hechacker1 - Wednesday, October 19, 2011 - link

    I don't think it's so hard. The scheduler would start all processes on the slow core, and if the CPU utilization doesn't exceed its maximum over a very short period of time, keep it there because it obviously doesn't need the extra processing power.

    The Android schedulers (CFS and BFS) are nano-second time aware, so the latency penalty could be managed.

    Of course, it would be best if the programmer could explicitly place their program into a core, but you can already do that with sched_realtime, sched_fifo, and sched_batch policies. The question is really how far Android optimizes for this sort of thing. Right now I think they treat everything as realtime fifo queues, instead of letting the built-in Linux schedulers do their thing.
  • lancedal - Thursday, October 20, 2011 - link

    Keep in mind that there are performance and latency penalty when power on/off the big core. When you power up the big core, it would take time (to power up, reset and boot). Its L2$ are empty, so it would need to be heated up. All of this added to performance impact that is very hard to take into account by the O.S.

    as you said, if programmer can specify performance/deadline/cpu requirement, then everything would be simple. However, we can't expect that from million of developers out there. It's just not practical.
  • Penti - Thursday, October 20, 2011 - link

    Boot? They would lay there as cores in a low power state, for which there would be several, it won't take tens of seconds to start up. I'm sure they can handle the power states, clocks and scheduling pretty well in the os's. The OS would know when it's in a power saving mode or not or when it needs the performance or not. It would very much depend on power profiles and so on. I would not expect them to be in a deep sleep mode at every time when resources are needed. But it's point is of course battery-life not performance. A flag on programs that want to run on the big cores would probably be easy to implement on a system like Android, I wouldn't think as it's not just pure Linux ELFs. But we will see what kind of schemes there will be soon I guess.
  • lancedal - Tuesday, October 25, 2011 - link

    Leakage is the reason they have "big-small" setup. So if the big core is not used, it will be "shut-down". So it will boot up from ROM when woken up. In 28 and 20nm, leakage will be the dominating power factor.
  • metafor - Thursday, October 20, 2011 - link

    You'd be surprised. Most modern OS's (including Android) have not only profiling but API support for applications to poll for resources such as CPU. Most of the time, you won't be seeing something like Pandora take the CPU up to 100% even if it could, in theory, burst process a lot of data and then go to sleep.

    The problem is that a lot of things can indeed be done faster -- web browser rendering is one primary example of something that would hog up as much CPU as it can.

    And there's not really a way for a user to specify "hey, I don't mind if the page renders slower, stop using so much power".
  • lancedal - Thursday, October 20, 2011 - link

    True, the OSs do profiling, but it is hardly accurate to guarantee real-time performance. Yes, the API would help, but not many application, if any, that specify the resource it needs as that would depend on system. It's just not practical to require software developer to specify the resource needed given how many developers and applications we have today. Big guy like Pandora, sure, but not million of little guys out there.
    Deciding when to switch back and forth between the LITTLE and BIG core is hard because it's not free. It cost power and performance (latency). If you switch to often, then you end up costing more power. The problem is there is no fix criteria to switch.
    If you have the little core to handle "system tasks" and the big core to handle application (like Tegra-3), then it may work. However, that only help standby power and wont' do much for extend web-browsing time.

Log in

Don't have an account? Sign up now