Math Libraries Performance Improvements, including:
* Significant improvements in double-precision FFT performance on Fermi-architecture GPUs for 2^n transform sizes
* Streaming API now supported in CUBLAS for overlapping copy and compute operations
* Real-to-complex (R2C) and complex-to-real (C2R) optimizations for 2^n data sizes
* Improved performance for GEMV and SYMV subroutines in CUBLAS
* Optimized double-precision implementations of divide and reciprocal routines
Unified Visual Profiler now supports both CUDA C/C++ and OpenCL, with:
* Support for start/stop profiling at runtime so you can focus on critical areas of long-running applications
* Support for CUDA Driver API tracing
Additional support for Fermi-architecture GPUs
* Significant performance improvement in the erfinvf() function
* 16-way kernel concurrency
* Support for printf() in device code
* cuda-memcheck updated for Fermi-architecture GPUs
Driver/Runtime interoperability allows mixing of CUDA C Runtime (and math libraries) with CUDA Driver API
New and updated SDK code samples demonstrating how to use:
* Function pointers in CUDA C/C++ kernels
* OpenCL / Direct3D buffer sharing
* Hidden Markov Model in OpenCL
* Microsoft Excel GPGPU example showing how to run an Excel function on the GPU
Note that this limited Beta release includes driver packages for Linux, MacOS, and Windows TCC (Tesla Compute Cluster) only. Standard Windows driver packages with graphics drivers and support for all NVIDIA GPUs will be available next month with the CUDA Toolkit 3.1 production release. In addition, Linux developers should note that the cuda-gdb hardware debugger was not ready for this beta release, but will be included in the production release. Windows developers should be sure to check out the new debugging features in Parallel Nsight for Visual Studio at www.nvidia.com/nsight.
NVIDIA released a beta version of the CUDA 3.1 toolkit for register developers.
New features from the programming guide :
16bits float textures supported by the runtime API. __float2half_rn() and __half2float() intrinsic added (Table C-3).
Surface memory interface exposed in the runtime API (Section 3.2.5, B9). Read/Write access into textures (CUDA Arrays). But limited to 1D and 2D Arrays yet.
Up to 16 parallel kernel launches on Fermi (it was only 4 in CUDA 3.0). Not sure how it is really implemented (one per SM ? multiple per SM ?).
Recursive calls supported in device function on Fermi (B.1.4). Stack size query and setting functions added (cudaThreadGetLimit(), cudaThreadSetLimit()).
Function pointers supported on device functions on Fermi (B.1.4). Function pointers to global functions supported on all GPUs.
Just noticed that a __CUDA_ARCH__ macro allowing to write different code paths depending on the architecture (or code executed on the host) is here since CUDA 3.0 (B.1.4).
printf support into kernels integrated into the API for sm_20 (B.14). Note that a cuprintf supporting all architectures was provided to register developers a few months ago.
New __byte_perm(x,y,s) intrinsic (C.2.3).
New __forceinline__ function qualifier to force inlining on Fermi. A __noinline__ was also present already to allow forcing function call on sm_1.x
New –dlcm compilation flag to specify global memory caching strategy on Fermi (G.4.2).
Interesting new stuff in the Fermi Compatibility Guide:
Just-in-time kernel compilation can be used with the runtime API with R195 drivers (Section 1.2.1).
Details using the volatile keyword for intra-warp communications (Section 1.2.2).
Interesting new stuff in the Best Practice Guide:
Uses signed integer instead of unsigned as loop counter. It allows the compiler to perform strength reduction and can provides better performances (Section 6.3).作者: westlee 时间: 2010-5-15 19:53
提示: 作者被禁止或删除 内容自动屏蔽作者: 66666 时间: 2010-5-15 19:56
Microsoft Excel GPGPU example showing how to run an Excel function on the GPU
westlee 发表于 2010-5-15 19:53