NVIDIA Fermi GF100 及 GF1XX 架构讨论

只看该作者 · 发表于 2009-11-9 13:09

提示: 作者被禁止或删除内容自动屏蔽

knightmaster · 发表于 2009-11-9 14:56

中阶DX11

应该是256SP/256bit 这个规模吧

这个规模的话,效能压住RV870没问题

melissa · 发表于 2009-11-9 16:32

中阶DX11

应该是256SP/256bit 这个规模吧

这个规模的话,效能压住RV870没问题
knightmaster 发表于 2009-11-9 14:56

= =......这个.....持保留意见,除非效率比GT200提高.285GTX连5850都比不过呢= =~

knightmaster · 发表于 2009-11-9 18:35

GT200的悲剧在频率上

就规模而言,并不比对手更小

Edison · 发表于 2009-11-13 12:39

Edison · 发表于 2009-11-13 12:45

文档似乎都是 CUDA 2.3 的。

denev2004 · 发表于 2009-11-14 21:29

GT200的悲剧在频率上

就规模而言,并不比对手更小
knightmaster 发表于 2009-11-9 18:35

是不是因为规模过大了频率才杯具的？

Edison · 发表于 2009-11-17 00:12

The family of Tesla 20-series GPUs includes:

Tesla C2050 & C2070 GPU Computing Processors
Single GPU PCI-Express Gen-2 cards for workstation configurations
Up to 3GB and 6GB (respectively) on-board GDDR5 memoryi
Double precision performance in the range of 520GFlops - 630 GFlops
Tesla S2050 & S2070 GPU Computing Systems
Four Tesla GPUs in a 1U system product for cluster and datacenter deployments
Up to 12 GB and 24 GB (respectively) total system memory on board GDDR5 memoryii
Double precision performance in the range of 2.1 TFlops - 2.5 TFlops
The Tesla C2050 and C2070 products will retail for $2,499 and $3,999 and the Tesla S2050 and S2070 will retail for $12,995 and $18,995. Products will be available in Q2 2010. For more information about the new Tesla 20-series products, visit the Tesla product pages.

Editors’ note: As previously announced, the first Fermi-based consumer (GeForce®) products are expected to be available first quarter 2010.

http://www.nvidia.com/object/io_1258360868914.html

另外，typical power draw at 190W, with a maximum of 225W.

zjcr · 发表于 2009-11-23 17:40

C++只是一种面向人与计算机沟通的语言，跟处理器的处理方式毫无直接联系（一旦编译了之后）

denev2004 · 发表于 2009-11-27 19:14

回ls,我感觉他的说法其实意思是说为C++指针提供空间，增强其的效率。

Asuka · 发表于 2009-11-28 09:24

我求证一下，NV官网提到的tesla 20架构，ECC居然是7+1的？

怎么回事？平常不应该都是8+1吗？

数据还能拆开的？

Edison · 发表于 2009-11-30 18:10

一篇关于 GT200 的原子操作性能测试报道：

http://strobe.cc/articles/cuda_atomics/

Three memory access patterns will be tested. The first goes straight for the jugular: all writes across an SM go to the same address, ensuring that all atomic operations cause a conflict. Each SM gets its own address, though, because having all processors write to the same location caused several system crashes during testing. This is expected to be nearly the worst case for atomic operations, and the results do not disappoint:

Ick. Let’s not do that again.
The next access pattern is less pessimal; each memory location is separated by 128 bytes, and each thread gets its own memory location, ensuring that no conflicts occur but also preventing the chip from coalescing any memory operations.

Well, that’s… tolerable. It remains to be seen whether atomics can be used for scatters in computation threads, but this looks like it wouldn’t cause too much damage. One last access pattern: this time, all threads are neatly coalesced, each accessing a 4-byte memory location in order, such that a warp hits a single 256-byte-wide, 256-byte-aligned region of memory.

Crap. That’s quite a bit worse. Sure, the total latency for an atomic operation is better, but the ratio between an uncoalesced atomic and read-modify-write latency is much smaller than that for the coalesced pattern, so the relative cost of atomic operations in this context is much worse.

只看该作者 · 发表于 2009-12-1 16:34

提示: 作者被禁止或删除内容自动屏蔽

frankexem · 发表于 2009-12-11 16:35

在这里 C++都成了托管代码了

nonolaw · 发表于 2009-12-16 09:12

貌似技术文档这个跨越性很大，希望理论能够出实际

胡小华 · 发表于 2009-12-16 16:02

hd4770:0 q& @+ E3 r4 d
我不知道NV的具体实现, 有没有一些其他的优化, 例如atomic操作会不会导致线程切换, 因此这个时间可以掩盖掉.
/ g( o( e/ V/ x( P, i b' [不过在大量访问的情形下, 是这个意思.0 w" C4 O5 ) {" V
RacingPHT 发表于 2009-10-5 09:42
$ t) X" `0 ^: C; { Y2 N
Thanks for clarifying it. From the published document, we can see threads are dymanically loading into SPs. So Yes, if tons of threads are active there, the halt of a few threads due to atomic operation confliction can be easily hidden by other threads. I think the improtance of atimics is to theoretically allow a big task to be broken down into small pieces. Although it sounds odd for multiple pieces working on the same address, but it cannot be ruled out. Today's operation systems tend to break tasks into thousands of small pieces, which could be speeded up by GPUs. Anyhow, your observation is very interesting.

Edison · 发表于 2009-12-20 22:24

multi-node 应该有一定的获益。

Edison · 发表于 2009-12-20 22:25

lemonninja · 发表于 2009-12-21 11:50

回复 77# Edison

有料暴了没？

zhj02002 · 发表于 2009-12-30 09:48

学习了~~~~~~

帐号		自动登录	找回密码
密码			注册

RacingPHT 该用户已被删除	61^# 发表于 2009-11-9 13:09 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
RacingPHT 该用户已被删除
	回复支持反对使用道具举报显身卡

zxjike 该用户已被删除	73^# 发表于 2009-12-1 16:34 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
zxjike 该用户已被删除
	回复支持反对使用道具举报显身卡

NVIDIA Fermi GF100 及 GF1XX 架构讨论

浏览过的版块