NVIDIA Fermi GF100 及 GF1XX 架构讨论

Edison · 发表于 2009-10-4 00:53

目前还不是很清楚 Fermi 的双精度实现成本，但是从 Cell 这边看，Cell eDP 和 Cell 相比，eDP 的成本主要在 SPE 上增加了 10%，而 Cell 的 DP 实现是类似于 GT200 也就是有还专门的 DP 运算单元实现。Fermi 的 eDP 是单精度/双精度单元一起跑实现的，所以在 eDP 的实现上，我想 Fermi 应该不会高到哪里去(+10% per 8 SP？）吧。

当然，这样的 10% 有 n 个存在的话，那就是很可观的成本增加了。

hd4770 · 发表于 2009-10-4 05:49

7# voodoo12345
natively support c++ = cuda compiler can compile c++ program. In gt200, it can only compile c program. That is the only difference. Optimization is not there initially. But as an open source community, anyone can contribute to an optimized extension. So one day, there might be a huge open library that provides optimized api. Currently, video transcoding, adobe ps4, flash, matlab, etc, are such examples. The selling point of gpu in the future would like this, if one bot a SOC (junior cpu + senior gpu), one could speed up one of his existing/developing C++ programs by calling a function in an c++ cuda extension. If that gives you even 2x speedup, that sounds valuable. Let alone, future operation systems running faster in SOC than intel only.

只看该作者 · 发表于 2009-10-4 21:51

提示: 作者被禁止或删除内容自动屏蔽

hd4770 · 发表于 2009-10-5 07:19

Atomic运算子的性能改进, 便可以减少多个SP争用一个地址的情况下的串行执行成本.RacingPHT 发表于 2009-10-4 21:51

Can you elaborate it? Here is my understanding about nv atom, say, threads A, B, C are accessing address A0, the order of excution for the 3 threads is random. If B is accessing it first, then A, C are halted until it is done. So the best total execution time for the 3 threads are A exec time + B exec time + C exec time. Is this close to what you meant here?

只看该作者 · 发表于 2009-10-5 09:42

提示: 作者被禁止或删除内容自动屏蔽

hd4770 · 发表于 2009-10-5 12:07

hd4770:
我不知道NV的具体实现, 有没有一些其他的优化, 例如atomic操作会不会导致线程切换, 因此这个时间可以掩盖掉.
不过在大量访问的情形下, 是这个意思.
RacingPHT 发表于 2009-10-5 09:42

Thanks for clarifying it. From the published document, we can see threads are dymanically loading into SPs. So Yes, if tons of threads are active there, the halt of a few threads due to atomic operation confliction can be easily hidden by other threads. I think the improtance of atimics is to theoretically allow a big task to be broken down into small pieces. Although it sounds odd for multiple pieces working on the same address, but it cannot be ruled out. Today's operation systems tend to break tasks into thousands of small pieces, which could be speeded up by GPUs. Anyhow, your observation is very interesting.

只看该作者 · 发表于 2009-10-5 18:30

提示: 作者被禁止或删除内容自动屏蔽

ic.expert · 发表于 2009-10-5 23:17

本帖最后由 ic.expert 于 2009-10-5 23:26 编辑

非常同意 RacingPHT 大哥看法：〉

另外建议陈总舵主的文章里面Append Buffer建议翻译为附加缓冲区，因为这东西是用作流处理的，缓冲两个Kerenl之间的数据，实际上应该叫做Productor Buffer更确切。

Edison · 发表于 2009-10-5 23:57

这个东西在以前是 geometry shader 用的，用来生成新的三角面。

只看该作者 · 发表于 2009-10-6 10:48

提示: 作者被禁止或删除内容自动屏蔽

hd4770 · 发表于 2009-10-6 11:24

HD4770:
既然NV强调了Atomic op在同一地址下的性能改进, 那么有理由认为这个操作是有必要的.

例如producer-consumer模式, 一些CUDA线程在制造任务包, 另一些CUDA线程在消耗这些任务包, 可能会出现这种情况.也许需 ...
RacingPHT 发表于 2009-10-5 18:30

Agree. One of the obvious usages for atom is the global sync. Or someone would call it syncBlock. Given blocks 0, 1, ..., N, each block's thread 0 atomically increments on a counter, sync all threads in the block, then every thread of the block polling that counter.

Edison · 发表于 2009-10-7 00:33

http://www.realworldtech.com/for ... 103203&roomid=2

今天在 RWT 的讨论串看到 DK 说根据 AMD CTO 的说法，RV770 可以做到 CKE，不过在随后的讨论中，有人认为这个仅仅是指 thread pool 里有 VS/PS+Computing 的 kernel 而执行是串列的 CKE，非同一时刻多个 computing kernel 并行执行的 CKE。

至于底层上 RV770 也许是有一些这方面的暗桩，但是目前的软件上是还没看到实质的支持，在现在而言，硬件上的功能如果没有软件支持，其实还是等于没有，希望 AMD 能尽快在这方面提供一些必要的技术支持，假如能 CKE 的话^^。

Eji · 发表于 2009-10-8 13:42

http://www.realworldtech.com/forums/index.cfm?action=detail&id=103306&threadid=103203&roomid=2

今天在 RWT 的讨论串看到 DK 说根据 AMD CTO 的说法，RV770 可以做到 CKE，不过在随后的讨论中，有人认为这个 ...
Edison 发表于 2009-10-7 00:33

確實ATI的concurrent kernels execution能力是目前最大的疑慮了....

Edison · 发表于 2009-10-9 02:25

原来 9 月份的时候 OKA 曾经流出过 OLCF-3 的超级电脑幻灯片，里面有些信息现在看来还是有些意思的：

http://www.cisl.ucar.edu/dir/CAS2K9/Presentations/bland.pdf

sharko · 发表于 2009-10-9 16:40

这个是不是倾向于科学计算的？ms图形方面没什么特别啊

Edison · 发表于 2009-10-9 16:59

no，只是 graphics 部分的细节未公开，但是遵循 DX11 是没问题的。

me210 · 发表于 2009-10-17 19:49

学习中~！

Edison · 发表于 2009-10-19 22:46

AMD 的 CKE 可能是这样实现的：

柏诚 · 发表于 2009-10-20 22:23

高手如云

SupperSix · 发表于 2009-10-21 13:45

我只想知道，以当前架构能带来多少实际游戏性能的提升

帐号		自动登录	找回密码
密码			注册

RacingPHT 该用户已被删除	23^# 发表于 2009-10-4 21:51 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
RacingPHT 该用户已被删除
	回复支持反对使用道具举报显身卡

RacingPHT 该用户已被删除	25^# 发表于 2009-10-5 09:42 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
RacingPHT 该用户已被删除
	回复支持反对使用道具举报显身卡

RacingPHT 该用户已被删除	27^# 发表于 2009-10-5 18:30 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
RacingPHT 该用户已被删除
	回复支持反对使用道具举报显身卡

RacingPHT 该用户已被删除	30^# 发表于 2009-10-6 10:48 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
RacingPHT 该用户已被删除
	回复支持反对使用道具举报显身卡

NVIDIA Fermi GF100 及 GF1XX 架构讨论

本帖子中包含更多资源

浏览过的版块