AMD's Graphics Core Next Preview: AMD's New GPU, Architected For Computeby Ryan Smith on December 21, 2011 9:38 PM EST
AMD Graphics Core Next: Out With VLIW, In With SIMD
The fundamental issue moving forward is that VLIW designs are great for graphics; they are not so great for computing. However AMD has for all intents and purposes bet the company on GPU computing – their Fusion initiative isn’t just about putting a decent GPU right on die with a CPU, but then utilizing the radically different design attributes of a GPU to do the computational work that the CPU struggles at. So a GPU design that is great at graphics and poor at computing work simply isn’t sustainable for AMD’s future.
With AMD Graphics Core Next, VLIW is going away in favor of a non-VLIW SIMD design. In principal the two are similar – run lots of things in parallel – but there’s a world of difference in execution. Whereas VLIW is all about extracting instruction level parallelism (ILP), a non-VLIW SIMD is primarily about thread level parallelism (TLP).
Without getting unnecessarily deep into the differences between VLIW and non-VLIW (we’ll save that for another time), the difference in the architectures is about what VLIW does poorly for GPU computing purposes, and why a non-VLIW SIMD fixes it. The principal issue is that VLIW is hard to schedule ahead of time and there’s no dynamic scheduling during execution, and as a result the bulk of its weaknesses follow from that. As VLIW5 was a good fit for graphics, it was rather easy to efficiently compile and schedule shaders under those circumstances. With compute this isn’t always the case; there’s simply a wider range of things going on and it’s difficult to figure out what instructions will play nicely with each other. Only a handful of tasks such as brute force hashing thrive under this architecture.
Furthermore as VLIW lives and dies by the compiler, which means not only must the compiler be good, but that every compiler is good. This is an issue when it comes to expanding language support, as even with abstraction through intermediate languages you can still run into issues, including issues with a compiler producing intermediate code that the shader compiler can’t handle well.
Finally, the complexity of a VLIW instruction set also rears its head when it comes to optimizing and hand-tuning a program. Again this isn’t normally a problem for graphics, but it is for compute. The complex nature of VLIW makes it harder to disassemble and to debug, and in turn difficult to predict performance and to find and fix performance critical sections of the code. Ideally a coder should never have to work in assembly, but for HPC and other uses there is a good deal of performance to be gained by doing so and optimizing down to the single instruction.
AMD provided a short example of this in their presentation, showcasing the example output of their VLIW compiler and their new compiler for Graphics Core Next. Being a coder helps, but it’s not hard to see how contrived things are under VLIW.
// Registers r0 contains "a", r1 contains "b"
// Value is returned in r2
1 x: PREDGT ____, R0.x, R1.x
UPDATE_EXEC_MASK UPDATE PRED
01 JUMP ADDR(3)
2 x: SUB ____, R0.x, R1.x
3 x: MUL_e R2.x, PV2.x, R0.x
03 ELSE POP_CNT(1) ADDR(5)
4 x: SUB ____, R1.x, R0.x
5 x: MUL_e R2.x, PV4.x, R1.x
05 POP(1) ADDR(6)
// Registers r0 contains "a", r1 contains "b"
// Value is returned in r2
//a > b, establish VCC
s0,exec //Save current exec mask
exec,vcc,exec //Do "if"
//Branch if all lanes fail
//result = a - b
//result=result * a
exec,s0,exec //Do "else" (s0 & !exec)
s_cbranch_execz label1 //Branch if all lanes fail
//result = b - a
//result = result * b
exec,s0 //Restore exec mask
VLIW: it’s good for graphics, it’s often not as good for compute.
So what does AMD replace VLIW with? They replace it with a traditional SIMD vector processor. While elements of Cayman do not directly map to elements of Graphics Core Next (GCN), since we’ve already been talking about the SP we’ll talk about its closest replacement: the SIMD.
Not to be confused with the SIMD on Cayman (which is a collection of SPs), the SIMD on GCN is a true 16-wide vector SIMD. A single instruction and up to 16 data elements are fed to a vector SIMD to be processed over a single clock cycle. As with Cayman, AMD’s wavefronts are 64 instructions meaning it takes 4 cycles to actually complete a single instruction for an entire wavefront. This vector unit is combined with a 64KB register file and that composes a single SIMD in GCN.
As is the case with Cayman's SPs, the SIMD is capable of a number of different integer and floating point operations. AMD has not gone into fine detail yet of what those are, but we’re expecting something similar to Cayman with the possible exception of how transcendentals are handled. One thing that we do know is that FP64 performance has been radically improved: the GCN architecture is capable of FP64 performance up to ½ its FP32 performance. For home users this isn’t going to make a significant impact right away, but it’s going to help AMD get into professional markets where such precision is necessary.
Post Your CommentPlease log in or sign up to comment.
View All Comments
StormyParis - Friday, June 17, 2011 - linkThank you for a very enlightening write up. Comments and questions:
1- please add a comma in there somewhere. I had to read the sentence 4 times to understand it (page 1=: "VLIW designs will never achieve perfect efficiency in this regard, but the farther off real world utilization is the weaker the benefits of VLIW."
2- When, if ever, will we vile users see any benefits ? I get the feeling that most apps are still not optimized well, if at all, for multicore/threading. Come to think of it, most don't even use most of the x86 extensions more recent than SSE2. Now we're talking of yet another x86 extension, that is not only AMD-specific, but very task-specific. Apart from a handful of labs doing GPU computing, and the usual Photoshop filters... i'm doubtful ?
MonkeyPaw - Friday, June 17, 2011 - linkI'm not an expert in this sort of design, but is AMD setting up this architecture to replace the x86 ALU? Bulldozer is already running 2 ALUs for every 1 FPU, which is promoting ALU-heavy software design. It may take a few revisions to meld them (or phase one out), but it certainly seems like that's a heterogeneous CPU in the end.
marc1000 - Friday, June 17, 2011 - linkthere is a slide (on Llano article, I believe) where AMD points this. yes, they want to completely merge them, and the ALU would be one of this mergind points.
A5 - Friday, June 17, 2011 - linkI think it'll be quite awhile before the monolithic cores dissolve into the heterogeneous architectures, mostly depending on how fine-grained the power gating can get. When it gets to the point where the CPU can selectively turn off components inside a given SIMD unit, I think we'll see someone go "Wait a minute, then why do we even have this big core anymore?" and it'll go away. 2018ish, maybe?
jamescox - Monday, June 20, 2011 - link
ALU is generally used to refer to a very simple unit that performs arithmetic, logic, and possibly bit shift operations on integers, not floating point values. The units labeled ALU in the GPU diagrams in the article may support some integer operations, but they mainly process 32-bit floating point values, and (IMO) should not be labeled as "ALUs". FPU would probably be more accurate, but I do not know what operations these units support and whether they include a native integer ALU or just convert to integers to FP.
I don't know what you would mean by ALU-heavy software design. Bulldozer has two integer execution cores per module. Each core is composed of 2 ALUs and 2 AGUs, not shared. It also has 2 128-bit floating point (FMA) units per module shared between the two threads. This isn't really much different than an intel hyper-threaded core. Intel has, I believe, 3 ALUs, 3 AGUs, and 2 FPUs per core which is shared between 2 threads. AMDs version of multi-threading just doesn't share as much hardware between threads, which may be better than Intel's HT (2-2-1 AMD vs 1.5-1.5-1 Intel ALU-AGU-FPU). Intel's version would allow a single thread to us all of the execution resources at once, if there is no competing thread. Sharing the FPU makes a lot of sense, since most code that runs on CPUs only uses the FPU intermittently. If the code uses FP more than intermittently, then it would be a candidate for vectorization, and execution on the GPU instead.
While AMDs next generation graphics hardware may be able to execute more general code compiled from a wider range of languages, it is not an x86 processor, and it can not replace the CPU. If you look at the diagram, it has a single scalar unit to handle non-vector code in each compute unit. It also has 64 units in the 4 vector arrays of each CU. If you actually tried to compile and run the kind of branch heavy, integer code that CPUs have to deal with on a CU, then it would probably run entirely and very, very slowly on that single scalar unit.
MrSpadge - Wednesday, June 22, 2011 - linkI think you've got the right idea with this being melted into a Bulldozer-like design. however, it wouldn't replace the x86 ALUs, which are highly-optimized for high clock speed and low latency execution, as well as excellent handling of branches etc.
No, it would rather replace or supplement a fat FPU shared between many "cores" (which, by then would basically mean ALUs + scheduling). Most tasks which requires massive fp number crunching can be executed well in parallel and therefore are suitable for execution on a GPU core. The question is just how to bond them together so that the software guys can actually use them..
Deleted - Thursday, December 22, 2011 - linkBasically, what we have here is a math coprocessor. Back in the day, Intel's x86 processors were very good (relatively speaking) at integer math, but choked on floating point math. So Intel created the 8087 to handle the floating point calculations while the CPU handled the integer calculations (obviously this wasn't exclusive to Intel, but I'm generalizing). Eventually, the floating point unit was merged onto the CPU, and programs began using them interchangeably.
What we have today is very similar. CPUs, even with their advanced FPUs, are nowhere near as powerful as the massively parallel monstrosities we use for graphics. Eventually, they will be merged onto the CPU, and used as readily for general floating point processing tasks as FPUs are currently.
And this is the point of Fusion: to fully replace the aging floating point unit with an IGP.
A5 - Friday, June 17, 2011 - linkThe benefits to home or enthusiast users of heterogeneous CPUs are still several years off. We need market penetration of hardware along with fundamental changes in software development models and smarter compilers.
nedwards - Tuesday, January 28, 2014 - linkSmarter programmers would help! Let me rephrase that. Programmers thinking in a parallel mindset would help!
Beenthere - Friday, June 17, 2011 - linkIf AMD delivers in a timely manner they will have a bright future. This looks like a huge technological transition and I understand the need to get developers onboard now but it also tips AMD's hand to Intel who will steal any ideas that they can.
Unfortunately we are still waiting for most applications to be written for 64-bit use so I'm not holding out much hope for an expeditious migration on a complex technological transition though it does appear that maybe AMD has been working on this for some time and may be able to do a better job of executing with Trinity and future products. Time will tell but I hope AMD delivers on time and they will definitely get my dime - all of them.