NVIDIA\'s GT200: Inside a Parallel Processor

玉清 · 发表于 2009-7-29 22:37

DR. David Kanter
Updated: 09-08-2008

Introduction

Over the last 10 years, an interesting trend in computing has emerged. General purpose CPUs,
such as those provided by Intel, IBM, Sun, AMD and Fujitsu have increased performance
substantially, but nowhere near the increases seen in the late 1980’s and early 1990’s. To a large
extent, single threaded performance increases have tapered off due to the low IPC in general
purpose workloads and the ‘power wall’ – the physical limits of power dissipation for integrated
circuits (ignoring for the moment exotic techniques such as IBM’s Multi-Chip Modules). The
additional millions and billions of transistors afforded by Moore’s Law are simply not very
productive for single threaded code and a great many are used for caches, which at least keeps
power consumption at reasonable levels.
At the same time, the GPU – which was once a special purpose parallel processor – has been able
to use ever increasing transistor budgets effectively, geometrically increasing rendering
performance over time since rendering is an inherently parallel application. As the GPU has grown
more and more computationally capable, it has also matured from an assortment of fixed function
units to a much more powerful and expressive collection of general purpose computational
resources, with some fixed function units on the side. Some of the first signs were when DirectX 9
(DX9) GPUs such as ATI’s R300 and the NVIDIA NV30 added support for limited floating point
arithmetic, or programmable pixel and vertex shaders in the DX8 generation. The obvious
watershed moment was the first generation of DirectX 10 GPUs, which required a unified
computational architecture instead of special purpose shader processors that operated on different
data types (pixels and vertices primarily). A more subtle turning point (or perhaps a moment of
foreshadowing) was when AMD acquired ATI – many people did not quite realize the motivation
was more complicated than simply competing with Intel on a platform level, but in any case,
DX10 made everything quite clear.

The first generation of high performance DX10 GPUs – the R600 from ATI and the G80 from
NVIDIA – offered the superior power of GPUs, with hundreds of functional units for a specific set
of data parallel problems that previously had only been run on CPUs. The emphasis here is on a
specific set of problems, as these initial GPUs were only appropriate for extremely data parallel
problems that used array-like data structures, with limited double precision needs. While these
GPUs were mostly IEEE compliant for 32 bit floating point math, they lacked the usual denormal
handling and omitted several rounding modes.
The result is that the computational world is suddenly more complex. Not only are there CPUs of
every type and variety, there are also now GPUs for data parallel workloads. Just as the
computational power of these products varies, so does the programmability and the range of
workloads for which they are suitable. Parallel computing devices such as GPUs, Cell and Niagara
tend to be hit or miss – they are all hopeless for any single threaded application and frequently are
poor performers for extremely branch-intensive, unpredictable and messy integer code, but for
sufficiently parallel problems they outperform the competition by factors of ten or a hundred.
Niagara and general purpose CPUs are more flexible, while GPUs are difficult to use with more
sophisticated data structures and the Cell processor is downright hostile to programmers.

Ironically, of the two GPU vendors NVIDIA turned out to have the most comprehensive and
consistent approach to general purpose computation – despite the fact that (or perhaps because) ATI
was purchased by a CPU company. This article focuses exclusively on the computational aspects of
NVIDIA’s GPUs, specifically CUDA and the recently released GT200 GPU which is used across
the GeForce, Tesla and Quadro product lines. We will not delve into the intricacies of the modern
3D pipeline as represented by DX10 and OpenGL 2.1, except to note that these are alternative
programming models that can be mapped to CUDA.

玉清 · 发表于 2009-7-29 22:38

本帖最后由玉清于 2009-7-29 22:44 编辑

CUDA ExecutionModel

NVIDIA’s parallel programming model and API is known as the Compute Unified Device
Architecture or CUDA. The philosophical and architectural underpinning of CUDA is to create a
virtually limitless sea of thread level parallelism, which can be dynamically exploited by the
underlying hardware. The CUDA programming model focuses on the GPU as a highly threaded
coprocessor to the host CPU and the associated memory model. The GPU is only useful for
extremely data parallel workloads, where similar calculations are run on vast quantities of data that
are arrayed in a regular grid-like fashion. Some classic examples of data parallel problems are
image processing, simulating physical models such as computational fluid dynamics, engineering
and financial modeling and analysis, searching, and sorting. However, workloads which require
more complex data structures, such as trees, associative arrays, linked lists, spatial subdivsion
structures, etc. will fare poorly on current GPUs which primarily work with array data structures.
Once a data parallel workload has been identified, portions of the application can be retargeted to
run on the GPU. The sections of the application that run on the GPU are known as kernels. Kernels
are not full applications – rather they are the data parallel essence of each major step of an
application.

原文链接：http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242

Real World Technologies 的 David Kanter 寫了一篇從 GPGPU 角度介紹 NVIDIA GT200 的文章。裡面對 GT200 及 G80 的架構有相當詳盡的分析，相當值得參考。

这是Mr.Hotball的关于这篇档案的介绍.

玉清 · 发表于 2009-7-29 22:44

P.S.这篇档案也有不完全的简体中文翻译版本.
是电子科技大学的樟树翻译的.

53898898 · 发表于 2010-1-18 17:45

哎呀，直接COPY啊？来个专业人才翻译一下~

帐号		自动登录	找回密码
密码			注册

NVIDIA\'s GT200: Inside a Parallel Processor

浏览过的版块