The CPU: A Dual-Core ARM Cortex A9


NVIDIA is a traditional ARM core licensee, which means it implements ARM’s own design rather than taking the instruction set and designing its own core around it (ala Qualcomm). 


The Tegra 2 SoC has a pair of ARM Cortex A9s running at up to 1GHz. The cores are clock gated but not power gated. Clock speed is dynamic and can be adjusted at a very granular level depending on load. Both cores operate on the same power plane.

Architecturally, the Cortex A9 isn’t very different from the Cortex A8. The main change is a move from a dual-issue in order architecture with the A8 to a dual-issue out-of-order architecture with the A9. 


With its OoO architecture, the A9 adds a re-order buffer and 16 more general purpose registers over the A8 for register renaming. Cortex A9 can reorder around write after read and write after write hazards.


ARM’s Cortex A9 MPcore architecture supports up to four cores behind a single shared L2 cache (at up to 1MB in size). Tegra 2 implements a full 1MB shared L2 cache and two Cortex A9 cores. Each core has a 64KB L1 cache (32KB instruction + 32KB data cache).

Pipeline depth is another major change between A8 and A9. While the Cortex A8 had a 13-cycle branch mispredict penalty, A9 shortens the pipeline to 8 cycles. The shallower pipeline improves IPC and reduces power consumption. Through process technology hitting 1GHz isn’t a problem at TSMC 40nm. 


From what I can tell, branch prediction, TLBs and execution paths haven’t changed between A8 and A9 although I’m still awaiting further details from ARM on this. 


NVIDIA is claiming the end result is a 20% increase in IPC between A8 and A9. That’s actually a bit lower than I’d expect, but combined with the move to dual core you should see a significant increase in performance compared to current Snapdragon and A8 based devices.

If on single threaded workloads the best performance improvement we see is 20%, Qualcomm’s dual-core 1.2GHz Snapdragon due out later this year could still be performance competitive.


While all Cortex A8 designs incorporated ARM’s SIMD engine called NEON, A9 gives you the option of integrating either a SIMD engine (ARM’s Media Processing Engine, aka NEON) or a non-vector floating point unit (VFPv3-D16). NVIDIA chose not to include the A9’s MPE and instead opted for the FPU. Unlike the A8’s FPU, in the A9 the FPU is fully pipelined - so performance is much improved. The A9’s FPU however is still not as quick at math as the optional SIMD MPE. 


Minimum Instruction Latencies (Single Precision)
ARM Cortex A8 (FPU) 9 cycles 9 cycles 10 cycles 18 cycles 20 cycles 19 cycles
ARM Cortex A9 (FPU) 4 cycles 4 cycles 5 cycles 8 cycles 15 cycles 17 cycles
ARM Cortex A8 (NEON) 1 cycle 1 cycle 1 cycle 1 cycle N/A N/A
ARM Cortex A9 (MPE/NEON) 1 cycle 1 cycle 1 cycle 1 cycle 10 cycles 13 cycles


NVIDIA claims implementing MPE would incur a 30% die penalty for a performance improvement that impacts only a minimal amount of code. It admits that at some point integrating a SIMD engine makes sense, just not yet. The table above shows a comparison of instruction latency on various floating point and SIMD engines in A8 and A9.


TI’s OMAP 4 on the other hand will integrate ARM’s Cortex A9 MPE. Depending on the code being run, OMAP 4 could have a significant performance advantage in some cases.

Introduction The GeForce ULV GPU
Comments Locked


View All Comments

  • DigitalFreak - Wednesday, January 5, 2011 - link

    Anand or Brian - any idea what carriers the 2x will be available on in the US?
  • therealnickdanger - Wednesday, January 5, 2011 - link

    While I don't know anything officially (I don't think anyone does yet), LG brought its current Optimus line to all carriers. I would imagine that they will do the same with the 2X - assuming one carrier seeking exclusivity doesn't dump a pile of cash on LG's doorstep.

    If the LG Optimus (S, T, One) are any indication of the build quality of the 2X, then I will probably be in line to buy one as soon as it is made available on Sprint. I love my Optimus S.
  • Cali3350 - Thursday, January 6, 2011 - link

    Probably the iPhone 4. When it comes to being smooth iOS is untouched at this point in time (probably because everything is GPU accelerated).
  • Cali3350 - Thursday, January 6, 2011 - link

    Clearly posted in response to the wrong item, sorry.
  • metafor - Wednesday, January 5, 2011 - link

    I think those are "throughput" numbers, not latency numbers. The technical reference manual:

    states "cycles" definition is merely the minimum number of cycles it takes to issue, not actually execute the instruction.

    VADD, for instance, takes 4 cycles for VFP (scalar) and 3 cycles for NEON/ASE (6 to writeback).
  • Cali3350 - Wednesday, January 5, 2011 - link

    Have to say Im pretty wildly disappointed with how you guys seem to be mentioning it will be several months before this thing launches.

    I am desperately waiting for a new phone right now, but want the new tech. The HTC Mecha looks like it will be a killer phone but will still be running the same old Snapdragon HTC has used since the Incredible. I was really hoping either the Optimus or the Motorola Tegra 2 phone would be out by end of January/February, but that seems like its DEF not happening after reading this. That is all sorts of disappointment.
  • aebiv - Wednesday, January 5, 2011 - link

    You mean the HD2 and the Nexus One... Incredible and EVO were late to the Snapdragon game.
  • sirsoffrito - Thursday, January 6, 2011 - link

    I have a Droid Incredible. I beg to differ.
  • solipsism - Wednesday, January 5, 2011 - link

    Is the die size different from the Cortex-A8? I’m wondering how this could affect placement in other smartphones.
  • solipsism - Wednesday, January 5, 2011 - link

    Er, I mean in the dual-core variety.

Log in

Don't have an account? Sign up now