POPPUR爱换

 找回密码
 注册

QQ登录

只需一步,快速开始

手机号码,快捷登录

搜索
楼主: Edison
打印 上一主题 下一主题

NVIDIA Fermi GF100 及 GF1XX 架构讨论

[复制链接]
RacingPHT 该用户已被删除
61#
发表于 2009-11-9 13:09 | 只看该作者
提示: 作者被禁止或删除 内容自动屏蔽
回复 支持 反对

使用道具 举报

62#
发表于 2009-11-9 14:56 | 只看该作者
中阶DX11

应该是256SP/256bit 这个规模吧

这个规模的话,效能压住RV870没问题
回复 支持 反对

使用道具 举报

63#
发表于 2009-11-9 16:32 | 只看该作者
中阶DX11

应该是256SP/256bit 这个规模吧

这个规模的话,效能压住RV870没问题
knightmaster 发表于 2009-11-9 14:56


= =......这个.....持保留意见,除非效率比GT200提高.285GTX连5850都比不过呢= =~
回复 支持 反对

使用道具 举报

64#
发表于 2009-11-9 18:35 | 只看该作者
GT200的悲剧在频率上

就规模而言,并不比对手更小
回复 支持 反对

使用道具 举报

65#
 楼主| 发表于 2009-11-13 12:39 | 只看该作者
CUDA 3.0 SDK beta 已经提供公开下载了:

Downloads
Getting Started - Linux
Getting Started - OS X
Getting Started - Windows

XP32 195.39
XP64 195.39
Vista/Win7 32 195.39
Vista/Win7 64 195.39

Notebook XP32 195.39
Notebok XP64 195.39
Notebook Vista/Win7 32 195.39
Notebook Vista/Win7 64 195.39

Linux 32 195.17
Linux 64 195.17

3.0.0 for Non-GT200 Leopard
3.0.1 for GT200 Leopard and Snow Leopard

CUDA Toolkit for Fedora 10 32-bit
CUDA Toolkit for RHEL 4.8 32-bit
CUDA Toolkit for RHEL 5.3 32-bit
CUDA Toolkit for SLED 11.0 32-bit
CUDA Toolkit for SuSE 11.1 32-bit
CUDA Toolkit for Ubuntu 9.04 32-bit

CUDA Toolkit for Fedora 10 64-bit
CUDA Toolkit for RHEL 4.8 64-bit
CUDA Toolkit for RHEL 5.3 64-bit
CUDA Toolkit for SLED 11.0 64-bit
CUDA Toolkit for SuSE 11.1 64-bit
CUDA Toolkit for Ubuntu 9.04 64-bit

CUDA Toolkit for OS X

CUDA Toolkit for Windows 32-bit
CUDA Toolkit for Windows 64-bit

CUDA Profiler 3.0 Beta Readme
CUDA Profiler 3.0 Beta Release Notes for Linux
CUDA Profiler 3.0 Beta Release Notes for OS X
CUDA Profiler 3.0 Beta Release Notes for Windows
CUDA Toolkit EULA
CUDA-GDB Readme
CUDA-GDB User Manual
CUDA Reference Manual
CUDA Toolkit Release Notes for Linux
CUDA Toolkit Release Notes for OS X
CUDA Toolkit Release Notes for Windows
CUDA Programming Guide
CUDA Best Practices Guide
Online Documentation

GPU Computing SDK for Linux
GPU Computing SDK for OS X
GPU Computing SDK for Win32
GPU Computing SDK for Win64

CUDA SDK Release Notes
DirectCompute Release Notes
OpenCL Release Notes
GPU Computing EULA
回复 支持 反对

使用道具 举报

66#
 楼主| 发表于 2009-11-13 12:45 | 只看该作者
文档似乎都是 CUDA 2.3 的。
回复 支持 反对

使用道具 举报

67#
发表于 2009-11-14 21:29 | 只看该作者
GT200的悲剧在频率上

就规模而言,并不比对手更小
knightmaster 发表于 2009-11-9 18:35

是不是因为规模过大了频率才杯具的?
回复 支持 反对

使用道具 举报

68#
 楼主| 发表于 2009-11-17 00:12 | 只看该作者
The family of Tesla 20-series GPUs includes:

Tesla C2050 & C2070 GPU Computing Processors
Single GPU PCI-Express Gen-2 cards for workstation configurations
Up to 3GB and 6GB (respectively) on-board GDDR5 memoryi
Double precision performance in the range of 520GFlops - 630 GFlops
Tesla S2050 & S2070 GPU Computing Systems
Four Tesla GPUs in a 1U system product for cluster and datacenter deployments
Up to 12 GB and 24 GB (respectively) total system memory on board GDDR5 memoryii
Double precision performance in the range of 2.1 TFlops - 2.5 TFlops
The Tesla C2050 and C2070 products will retail for $2,499 and $3,999 and the Tesla S2050 and S2070 will retail for $12,995 and $18,995. Products will be available in Q2 2010. For more information about the new Tesla 20-series products, visit the Tesla product pages.

Editors’ note: As previously announced, the first Fermi-based consumer (GeForce®) products are expected to be available first quarter 2010.

http://www.nvidia.com/object/io_1258360868914.html

另外,typical power draw at 190W, with a maximum of 225W.
回复 支持 反对

使用道具 举报

69#
发表于 2009-11-23 17:40 | 只看该作者
C++只是一种面向人与计算机沟通的语言,跟处理器的处理方式毫无直接联系(一旦编译了之后)
回复 支持 反对

使用道具 举报

70#
发表于 2009-11-27 19:14 | 只看该作者
回ls,我感觉他的说法其实意思是说为C++指针提供空间,增强其的效率。
回复 支持 反对

使用道具 举报

71#
发表于 2009-11-28 09:24 | 只看该作者
我求证一下,NV官网提到的tesla 20架构,ECC居然是7+1的?

怎么回事?平常不应该都是8+1吗?

数据还能拆开的?
回复 支持 反对

使用道具 举报

72#
 楼主| 发表于 2009-11-30 18:10 | 只看该作者
一篇关于 GT200 的原子操作性能测试报道:

http://strobe.cc/articles/cuda_atomics/

Three memory access patterns will be tested. The first goes straight for the jugular: all writes across an SM go to the same address, ensuring that all atomic operations cause a conflict. Each SM gets its own address, though, because having all processors write to the same location caused several system crashes during testing. This is expected to be nearly the worst case for atomic operations, and the results do not disappoint:

Ick. Let’s not do that again.
The next access pattern is less pessimal; each memory location is separated by 128 bytes, and each thread gets its own memory location, ensuring that no conflicts occur but also preventing the chip from coalescing any memory operations.

Well, that’s… tolerable. It remains to be seen whether atomics can be used for scatters in computation threads, but this looks like it wouldn’t cause too much damage. One last access pattern: this time, all threads are neatly coalesced, each accessing a 4-byte memory location in order, such that a warp hits a single 256-byte-wide, 256-byte-aligned region of memory.

Crap. That’s quite a bit worse. Sure, the total latency for an atomic operation is better, but the ratio between an uncoalesced atomic and read-modify-write latency is much smaller than that for the coalesced pattern, so the relative cost of atomic operations in this context is much worse.
回复 支持 反对

使用道具 举报

zxjike 该用户已被删除
73#
发表于 2009-12-1 16:34 | 只看该作者
提示: 作者被禁止或删除 内容自动屏蔽
回复 支持 反对

使用道具 举报

74#
发表于 2009-12-11 16:35 | 只看该作者
在这里 C++都成了托管代码了
回复 支持 反对

使用道具 举报

75#
发表于 2009-12-16 09:12 | 只看该作者
貌似技术文档这个跨越性很大,希望理论能够出实际
回复 支持 反对

使用道具 举报

76#
发表于 2009-12-16 16:02 | 只看该作者
hd4770:0 q& @+ E3 r4 d
我不知道NV的具体实现, 有没有一些其他的优化, 例如atomic操作会不会导致线程切换, 因此这个时间可以掩盖掉.
/ g( o( e/ V/ x( P, i  b' [不过在大量访问的情形下, 是这个意思.0 w" C4 O5 ) {" V
RacingPHT 发表于 2009-10-5 09:42
$ t) X" `0 ^: C; {  Y2 N
Thanks for clarifying it. From the published document, we can see threads are dymanically loading into SPs. So Yes, if tons of threads are active there, the halt of a few threads due to atomic operation confliction can be easily hidden by other threads. I think the improtance of atimics is to theoretically allow a big task to be broken down into small pieces. Although it sounds odd for multiple pieces working on the same address, but it cannot be ruled out. Today's operation systems tend to break tasks into thousands of small pieces, which could be speeded up by GPUs. Anyhow, your observation is very interesting.
回复 支持 反对

使用道具 举报

77#
 楼主| 发表于 2009-12-20 22:24 | 只看该作者


multi-node 应该有一定的获益。

回复 支持 反对

使用道具 举报

78#
 楼主| 发表于 2009-12-20 22:25 | 只看该作者


回复 支持 反对

使用道具 举报

79#
发表于 2009-12-21 11:50 | 只看该作者
回复 77# Edison


    有料暴了没?
回复 支持 反对

使用道具 举报

80#
发表于 2009-12-30 09:48 | 只看该作者
学习了~~~~~~
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

广告投放或合作|网站地图|处罚通告|

GMT+8, 2025-8-26 13:38

Powered by Discuz! X3.4

© 2001-2017 POPPUR.

快速回复 返回顶部 返回列表