DR. David Kanter
Updated: 09-08-2008
Introduction
Over the last 10 years, an interesting trend in computing has emerged. General purpose CPUs,
such as those provided by Intel, IBM, Sun, AMD and Fujitsu have increased performance
substantially, but nowhere near the increases seen in the late 1980’s and early 1990’s. To a large
extent, single threaded performance increases have tapered off due to the low IPC in general
purpose workloads and the ‘power wall’ – the physical limits of power dissipation for integrated
circuits (ignoring for the moment exotic techniques such as IBM’s Multi-Chip Modules). The
additional millions and billions of transistors afforded by Moore’s Law are simply not very
productive for single threaded code and a great many are used for caches, which at least keeps
power consumption at reasonable levels.
At the same time, the GPU – which was once a special purpose parallel processor – has been able
to use ever increasing transistor budgets effectively, geometrically increasing rendering
performance over time since rendering is an inherently parallel application. As the GPU has grown
more and more computationally capable, it has also matured from an assortment of fixed function
units to a much more powerful and expressive collection of general purpose computational
resources, with some fixed function units on the side. Some of the first signs were when DirectX 9
(DX9) GPUs such as ATI’s R300 and the NVIDIA NV30 added support for limited floating point
arithmetic, or programmable pixel and vertex shaders in the DX8 generation. The obvious
watershed moment was the first generation of DirectX 10 GPUs, which required a unified
computational architecture instead of special purpose shader processors that operated on different
data types (pixels and vertices primarily). A more subtle turning point (or perhaps a moment of
foreshadowing) was when AMD acquired ATI – many people did not quite realize the motivation
was more complicated than simply competing with Intel on a platform level, but in any case,
DX10 made everything quite clear.
The first generation of high performance DX10 GPUs – the R600 from ATI and the G80 from
NVIDIA – offered the superior power of GPUs, with hundreds of functional units for a specific set
of data parallel problems that previously had only been run on CPUs. The emphasis here is on a
specific set of problems, as these initial GPUs were only appropriate for extremely data parallel
problems that used array-like data structures, with limited double precision needs. While these
GPUs were mostly IEEE compliant for 32 bit floating point math, they lacked the usual denormal
handling and omitted several rounding modes.
The result is that the computational world is suddenly more complex. Not only are there CPUs of
every type and variety, there are also now GPUs for data parallel workloads. Just as the
computational power of these products varies, so does the programmability and the range of
workloads for which they are suitable. Parallel computing devices such as GPUs, Cell and Niagara
tend to be hit or miss – they are all hopeless for any single threaded application and frequently are
poor performers for extremely branch-intensive, unpredictable and messy integer code, but for
sufficiently parallel problems they outperform the competition by factors of ten or a hundred.
Niagara and general purpose CPUs are more flexible, while GPUs are difficult to use with more
sophisticated data structures and the Cell processor is downright hostile to programmers.
Ironically, of the two GPU vendors NVIDIA turned out to have the most comprehensive and
consistent approach to general purpose computation – despite the fact that (or perhaps because) ATI
was purchased by a CPU company. This article focuses exclusively on the computational aspects of
NVIDIA’s GPUs, specifically CUDA and the recently released GT200 GPU which is used across
the GeForce, Tesla and Quadro product lines. We will not delve into the intricacies of the modern
3D pipeline as represented by DX10 and OpenGL 2.1, except to note that these are alternative
programming models that can be mapped to CUDA. |