POPPUR爱换

标题: NVIDIA G80 CUDA文档、开发工具发布加入与其他GPGPU语言的性能对比 [打印本页]

作者: Edison 时间: 2007-2-16 16:22
标题: NVIDIA G80 CUDA文档、开发工具发布加入与其他GPGPU语言的性能对比
NVIDIA CUDA HomepageQuicklinks

Downloads Documentation [Download] CUDA Programming Guide Version 0.8 (.pdf)

[Download] CUDA Toolkit Version 0.8 Release Notes (.txt)

[Download] CUDA BLAS Library Version 0.8 Reference Documentation (.pdf)

[Download] CUDA FFT Library Version 0.8 Reference Documentation (.pdf)
Complete Install Packages Including Documentation [Download] Installer for CUDA Toolkit Version 0.8 and CUDA SDK Version 0.8 for Linux X86 32-bit [Red Hat Enterprise Linux 4 (Nahant Update 3)]

[[url=http://developer.download.nvidia.com/compute/cuda/0_8/NVIDIA-Linux-x86-1[1].0-9751-pkg1.run]Download[/url]] NVIDIA Linux Display Driver Version 97.51 for CUDA Toolkit Version 0.8

[Download] Installer for CUDA Toolkit Version 0.8 and CUDA SDK Version 0.8 for Windows XP (32-bit)

[Download] NVIDIA Windows Display Driver version 97.73 for CUDA Toolkit Version 0.8
NVIDIA CUDA
Revolutionary GPU Computing NVIDIA® CUDA™ technology is a fundamentally new computing architecture that enables the GPU to solve complex computational problems in consumer, business, and technical applications. CUDA (Compute Unified Device Architecture) technology gives computationally intensive applications access to the tremendous processing power of NVIDIA graphics processing units (GPUs) through a revolutionary new programming interface. Providing orders of magnitude more performance and simplifying software development by using the standard C language, CUDA technology enables developers to create innovative solutions for data-intensive problems. For advanced research and language development, CUDA includes a low level assembly language layer and driver interface.
Developing with CUDA The CUDA Toolkit is a complete software development solution for programming CUDA-enabled GPUs. The Toolkit includes standard FFT and BLAS libraries, a C-compiler for the NVIDIA GPU and a runtime driver. The CUDA runtime driver is a separate standalone driver that interoperates with OpenGL and Microsoft® DirectX® drivers from NVIDIA. CUDA technology is currently supported on the Linux and Microsoft® Windows® XP operating systems.
The CUDA Developer SDK provides examples with source code to help you get started with CUDA. Examples include:

Parallel bitonic sort
Matrix multiplication
Matrix transpose
Performance profiling using timers
Parallel prefix sum (scan) of large arrays
Image convolution
1D DWT using Haar wavelet
OpenGL and Direct3D graphics interoperation examples
CUDA BLAS and FFT library usage examples
CPU-GPU C- and C++-code integration

Technology Features

Unified hardware and software solution for parallel computing on CUDA-enabled NVIDIA GPUs
CUDA-enabled GPUs support the Parallel Data Cache and Thread Execution Manager for high performance computing
Standard C programming language enabled on a GPU
Standard numerical libraries for FFT (Fast Fourier Transform) and BLAS (Basic Linear Algebra Subroutines)
Dedicated CUDA driver for computing
Optimized upload and download path from the CPU to CUDA-enabled GPU
CUDA driver interoperates with OpenGL and DirectX graphics drivers
Support for Linux and Windows XP operating systems
Scales from high performance professional graphics solutions to mobile and embedded GPUs
Native multi-GPU support for high density computing with Quadro CUDA-enabled GPUs
Direct driver and assembly level access through CUDA for research and language development

CUDA technology GPU computing with CUDA technology is an innovative combination of computing features in next generation NVIDIA GPUs that are accessed through the standard ‘C’ language. Where previous generation GPUs were based on “streaming shader programs”, CUDA programmers use ‘C’ to create programs called kernels that use many threads to operate on large quantities of data in parallel. In contrast to multi-core CPUs, where only a few threads execute at the same time, NVIDIA GPUs featuring CUDA technology process thousands of threads simultaneously enabling high computational throughput across large amounts of data.

GPGPU, or "General-Purpose Computation on GPUs", has traditionally required the use of a graphics API such as OpenGL, which presents the wrong abstraction for general-purpose parallel computation. Therefore, traditional GPGPU applications are difficult to write, debug, and optimize. NVIDIA GPU Computing with CUDA enables direct implementation of parallel computations in the C language using an API designed for general-purpose computation.

One of the most important innovations offered by CUDA technology is the ability for threads on NVIDIA GPUs to cooperate when solving a problem. By enabling threads to communicate, CUDA technology allows applications to operate more efficiently. NVIDIA GPUs featuring CUDA technology have an on-chip Parallel Data Cache that developers can use to store frequently used information directly on the GPU. Storing information on the GPU allows computing threads to instantly share information rather than wait for data from much slower, off-chip DRAMs. This advance in technology enables users to find the answers to complex computational problems much more quickly than using traditional architectures or GPGPU that is limited to graphics API-based GPU programming.
Why Use CUDA technology? Performance. NVIDIA GPUs offer incredible performance for data-intensive applications. CUDA technology provides a standard, widely available solution for delivering new applications with unprecedented capability.

Compatibility. Applications developed with the CUDA C-compiler are compatible with future generation GPUs from NVIDIA. Developers investing in GPU computing will immediately benefit from the performance of current GPUs and be confident in NVIDIA’s future investment in high performance technology for GPU computing.

Productivity. Developers wanting to tap into NVIDIA GPU computing power can now use the industry standard “C” language for software development. CUDA provides a complete development solution that integrates CPU and GPU software to enable developers to quickly provide new features and greater value for their customers.

Scalability. Applications developed with CUDA technology scale in performance and features across the full line of NVIDIA G8X and future GPUs from embedded form factors to high-performance professional graphics solutions using multiple GPUs. The power of CUDA performance is now available in virtually any system class, from cluster computing installations to consumer products.

作者: graphiccard 时间: 2007-2-16 16:23
好东西
猛贴要流名:lol:

作者: Edison 时间: 2007-2-16 16:23
The GeForce 8800 Series has the following characteristics:
# The maximum number of threads per block is 512;
# The maximum size of each dimension of a grid of thread blocks is 65535;
# The number of multiprocessors is
# 16 at 675 MHz for the GeForce 8800 GTX,
# 12 at 600 MHz for the GeForce 8800 GTS;
# The amount of device memory is
# 768 MB for the GeForce 8800 GTX,
# 640 MB for the GeForce 8800 GTS;
# The amount of shared memory available per multiprocessor is 16 KB divided into 16 banks (see Section 6.1.2.4);
# The amount of constant memory available is 64 KB with a cache working set of 8 KB per multiprocessor;
# The cache working set for 1D textures is 8 KB per multiprocessor;
# The warp size is 32 threads;
# Texture filtering weights are stored in 9-bit fixed point format with 8 bits of fractional value.
# In Beta version 0.8, the maximum observed bandwidth between system memory and device memory is 2 GB/s.

Each multiprocessor is composed of eight processors running at twice the clock frequencies mentioned above, so that a multiprocessor is able to process the 32 threads of a warp in two clock cycles.

作者: 贵族蓝翼 时间: 2007-2-16 16:25
提示: 作者被禁止或删除内容自动屏蔽

作者: Edison 时间: 2007-2-16 16:26
Floating-Point Standard
The GeForce 8800 Series follows the IEEE-754 standard for single-precision binary floating-point arithmetic with the following deviations:

# Addition and multiplication are often combined into a single multiply-add instruction (FMAD);
# Division is implemented via the reciprocal in a non-standard-compliant way;
# Square root is implemented via the reciprocal square root in a non-standard-compliant way;
# For addition and multiplication, only round-to-nearest-even and round-towards-zero are supported via static rounding modes; directed rounding towards +/- infinity is not supported;
# There is no dynamically configurable rounding mode;
# Denormalized source operands are treated as zero;
# Underflowed results are flushed to zero;
# There is no mechanism for detecting that a floating-point exception has occurred and floating-point exceptions are always masked, but when an exception occurs the masked response is standard compliant;
# Signaling NaNs are not supported.
# The result of an operation involving one or more input NaNs is not one of the input NaNs, but a canonical NaN of bit pattern 0x7fffffff. Note that in accordance to the IEEE-754R standard, if one of the input parameters to min() or max() is NaN, but not the other, the result is the non-NaN parameter.

The conversion of a floating-point value to an integer value in the case where the floating-point value falls outside the range of the integer format is left undefined by IEEE-754. For the GeForce 8800 Series, the behavior is to clamp to the end of the supported range. This is unlike the x86 architecture behaves.

作者: Edison 时间: 2007-2-16 16:28
Arithmetic Instructions throughput performance

To issue one instruction for a warp, a multiprocessor takes:

# 2 clock cycles for floating-point add, floating-point multiply, floating-point multiply-add, integer add, bitwise operations, compare, min, max, type conversion instruction;

# 8 clock cycles for reciprocal, reciprocal square root, __log(x) (see Table A-2).

32-bit integer multiplication takes 8 clock cycles, but __mul24 and __umul24 (see Appendix A) provide signed and unsigned 24-bit integer multiplication in 2 clock cycles. Integer division and modulo operation are particularly costly and should be avoided if possible or replaced with bitwise operations whenever possible: If n is apower of 2, (i/n) is equivalent to (i>>log2(n)) and (i%n) is equivalent to (i&(n-1)); the compiler will perform these conversions if n is literal.

Other functions take more clock cycles as they are implemented as combinations of several instructions.
Floating-point square root is implemented as a reciprocal square root followed by a reciprocal, so it takes 16 clock cycles for a warp.

Floating-point division takes 18 clock cycles, but __fdividef(x, y) provides a faster version at 10 clock cycles (see Appendix A).
__sin(x), __cos(x), __exp(x) take 16 clock cycles.

Sometimes, the compiler must insert conversion instructions, introducing additional execution cycles. This is the case for:
# Functions operating on char or short whose operands generally need to be converted to int,
# Double-precision floating-point constants (defined without any type suffix) used as input to single-precision floating-point computations,
# Single-precision floating-point variables used as input parameters to the double-precision version of the mathematical functions defined in Table A-1.
The two last cases can be avoided by using:
# Single-precision floating-point constants, defined with an f suffix such as 3.141592653589793f, 1.0f, 0.5f,
# The single-precision version of the mathematical functions, defined with an f suffix as well, such as sinf(), logf(), expf().
For single precision code, we highly recommend use of the single precision math functions. When compiling for devices without native double precision support, the double precision math functions are by default mapped to their single precision equivalents. However, on those future devices that will support

作者: Edison 时间: 2007-2-16 16:30
Control Flow Instructions

Any flow control instruction (if, switch, do, for, while) can significantly impact the effective instruction throughput by causing threads of the same warp to diverge, that is, to follow different execution paths. If this happens, the different executions paths have to be serialized, increasing the total number of instructions executed for this warp.

To obtain best performance in cases where the control flow depends on the thread ID, the controlling condition should be written so as to minimize the number of divergent warps. This is possible because the distribution of the warps across the block is deterministic as mentioned in Section 3.2. A trivial example is when the controlling condition only depends on (threadIdx / WSIZE) where WSIZE is the warp size. In this case, no warp diverges since the controlling condition is perfectly aligned with the warps.

Sometimes, the compiler may unroll loops or it may optimize out if or switch statements by using branch predication instead, as detailed below. In these cases, no warp can ever diverge.

When using branch predication none of the instructions whose execution depends on the controlling condition gets skipped. Instead, each of them is associated with a per-thread condition code or predicate that is set to true or false based on the controlling condition and although each of these instructions gets scheduled for execution, only the instructions with a true predicate are actually executed. Instructions with a false predicate do not write results, and also do not evaluate addresses or read operands.

The compiler replaces a branch instruction with predicated instructions only if the number of instructions controlled by the branch condition is less or equal to a certain threshold: If the compiler determines that the condition is likely to produce many divergent warps, this threshold is 7, otherwise it is 4.

Memory Instructions

Memory instructions include any instruction that reads from or writes to shared or global memory. A multiprocessor takes 2 clock cycles to issue one memory instruction for a warp. When accessing global memory, there are, in addition, 200 to 300 clock cycles of memory latency.

As an example, the assignment operator in the following sample code:

__shared__ float shared[32];
__device__ float device[32];
shared[threadIdx.x] = device[threadIdx.x];

takes 2 clock cycles to issue a read from global memory, 2 clock cycles to issue a write to shared memory, but above all 200 to 300 clock cycles to read a float from global memory.

Much of this global memory latency can be hidden by the thread scheduler if there are sufficient independent arithmetic instructions that can be issued while waiting for the global memory access to complete.

作者: lacri 时间: 2007-2-16 16:32
不是程序员滴漂过～～

作者: Edison 时间: 2007-2-16 16:33
Synchronization Instruction

__syncthreads takes 2 clock cycles to issue for a warp if no thread has to wait for any other threads.

Memory Bandwidth

The effective bandwidth of each memory space depends significantly on the memory access pattern as detailed in the following sub-sections.

Since device memory is of much higher latency and lower bandwidth than on-chip memory, device memory accesses should be minimized. A typical programming pattern is to stage data coming from device memory into shared memory; in other words, to have each thread of a block:

# Load data from device memory to shared memory,
# Synchronize with all the other threads of the block so that each thread can safely read shared memory locations that were written by different threads,
# Process the data in shared memory,
# Synchronize again if necessary to make sure that shared memory has been updated with the results,
# Write the results back to device memory.

Number of Threads per Block

For the GeForce 8800 Series, 64 threads per block is minimal and makes sense onif there are multiple concurrent blocks. 192 or 256 threads per block is better usually allows for enough registers to compile.

The number of blocks per grid should be at least 100 if one wants it to scale to future devices; 1000 blocks will scale across several generations.

作者: iiiiuuuu 时间: 2007-2-17 10:51
当年不是说得支持双精度浮点数吗？

作者: Edison 时间: 2007-5-8 10:06
32-bit integer multiplication takes 16 clock cycles, but __mul24 and __umul24 (see Appendix A) provide signed and unsigned 24-bit integer multiplication in 4 clock cycles.

这里应该是指32bit的整数指令是在SFU上执行，而24bit的整数指令是在SP上执行。

作者: Edison 时间: 2007-8-27 16:52

PLATFORM	1k SGEMM	1k 1d complex FFT
Peakstream (r520)	80.13	8.7
CUDA (g80)	95	43.4 (?? checking this#)
RapidMind(g80)	24	7.5
RapidMind(r520)	26	4.9

http://www.gpgpu.org/forums/viewtopic.php?t=4207

作者: jocover 时间: 2007-8-27 17:19
Vista下不能用？

作者: akcadia 时间: 2007-8-27 17:24
my 鸡肠 is so poor ....

作者: Prescott 时间: 2007-8-28 00:00
不多说，先port个MM5或者GRAPES或者Fluent到上面跑跑试试？

Cell也一样，随便找个比较简单的MM5跑一下就行，能跑就行，先不说性能能到一个通用处理器的几分之一。:lol:

作者: Ricepig 时间: 2007-8-28 00:09
不支持double，还有不支持G80之前的显卡~~~

真惨

作者: Edison 时间: 2007-8-28 00:53

原帖由 Ricepig 于 2007-8-28 00:09 发表
不支持double，还有不支持G80之前的显卡~~~

真惨

CUDA支持DP，但是目前的g8x不具备该硬件特性，我看不出为啥要支持完全不符合CUDA 1.0的旧产品。

当然在双精度方面，之前也有人尝试过先用GPU或者Cell SPU跑到一定的精度然后用CPU加强，这样的预运算方式也是能达到在IEEE754的GPU（例如G80、SPU）上获得不错的双精度运算速度。

作者: Ricepig 时间: 2007-8-28 04:19

原帖由 Edison 于 2007-8-28 00:53 发表

CUDA支持DP，但是目前的g8x不具备该硬件特性，我看不出为啥要支持完全不符合CUDA 1.0的旧产品。

当然在双精度方面，之前也有人尝试过先用GPU或者Cell SPU跑到一定的精度然后用CPU加强，这样的预运算方式 ...

CUDA1.0明确不支持double，文档上写要等下一版，且CUDA的数据类型里没有64位浮点。

先用GPU然后再CPU确实有人做，但是局限性比较大，你无法保证对每一种运算都有收敛较快并且CPU计算较快的迭代方法

作者: Edison 时间: 2007-8-28 10:31

原帖由 Ricepig 于 2007-8-28 04:19 发表
CUDA1.0明确不支持double，文档上写要等下一版，且CUDA的数据类型里没有64位浮点。

在CUDA 1.0的文档中：
For single precision code, we highly recommend use of the float type and the single precision math functions. When compiling for devices without native double precision support, such as devices of compute capability 1.x, the double type gets demoted to float by default and the double precision math functions are mapped to their single precision equivalents. However, on those future devices that will support double precision, these functions will map to double precision implementations

在CUDA 0.8的文档中：

Q: Does CUDA support Double Precision Floating Point arithmetic?

A: CUDA supports the C "double" data type. However on G80
(e.g. GeForce 8800) GPUs, these types will get demoted to 32-bit
floats. NVIDIA GPUs supporting double precision in hardware will
become available in late 2007.

从上面的说法来看，我觉得CUDA目前提供的Flop实际上就包括了fp64，只是在编译对象为g80的时候，这些fp64数会被映射为fp32。

作者: Ricepig 时间: 2007-8-28 14:44

原帖由 Edison 于 2007-8-28 10:31 发表

在CUDA 1.0的文档中：
For single precision code, we highly recommend use of the float type and the single precision math functions. When compiling for devices without native double precision ...

就是个假double，还是等于不支持吧？
另外，直接对应double的CUDA类型是什么？我怎么只看到float32?

作者: Edison 时间: 2007-8-28 14:55
上面说的很清楚了，按照NVIDIA的说法，CUDA的软件层面上是支持C FP64，在目前的G80上这些FP64会被映射为FP32，在G80上这些FP64都是按照FP32来处理，但是未来的G9X HPC版都是完全按照FP64运行。

作者: Ricepig 时间: 2007-8-28 23:12

原帖由 Edison 于 2007-8-28 14:55 发表
上面说的很清楚了，按照NVIDIA的说法，CUDA的软件层面上是支持C FP64，在目前的G80上这些FP64会被映射为FP32，在G80上这些FP64都是按照FP32来处理，但是未来的G9X HPC版都是完全按照FP64运行。

找个CUDA里的FP64类型吧?我好像没找到

作者: 大姨妈 时间: 2007-8-28 23:58
英文不好,看不懂........................

作者: Edison 时间: 2007-8-29 01:12

原帖由 Ricepig 于 2007-8-28 23:12 发表

找个CUDA里的FP64类型吧?我好像没找到

安装个CUDA，然后直接用G80跑double，只是会（被驱动程序）映射为fp32。

11月的SC07会有Dave Luebke参与协助的会议，到时候可能会有更多G9x HPC的消息。

欢迎光临 POPPUR爱换 (https://we.poppur.com/)