POPPUR爱换

标题: NVIDIA Fermi GF100 及 GF1XX 架构讨论 [打印本页]

作者: Edison 时间: 2009-10-2 16:24
标题: NVIDIA Fermi GF100 及 GF1XX 架构讨论
Fermi 架构的初步概况已经发布，大家可以就相关的话题在本主题中展开讨论（原来的讨论串作为猜测性质，现在已经关闭，你可以在这个连接里回顾：http://we.pcinlife.com/thread-969042-1-1.html）。

Fermi 的体系结构亮点如下：

第三代 Streaming Multiprocessor（流式多处理器）
每个 SM 有 32 个内核，4 倍于 GT200 的 SM。
双精度浮点性能达到 GT200 的 8 倍。
配备双 Warp Scheduler（Warp 调度器），能每个周期对两个宽度为 32 线程的 Warp 进行排程和分发。
每个 SM 有 64kB 可配置为 shared memory 和 cache 的随机存取内存
第二代并行线程执行（PTX）ISA
统一寻址空间，提供完全 C++ 支持
针对 OpenCL 和 DirectCompute 作最佳化
完全的 IEEE 754-2008 32-bit/64-bit 精度支持
透过 Predication（论断）提升性能
改进的内存子系统
拥有可配置 L1 cache 和统一化 L2 cache 的 NVIDIA 并行数据高速缓存（PDC）
第一枚提供 ECC 内存支持的 GPU
显著提升原子内存操作性能
NVIDIA GigaThread 3.0 引擎
应用程序上下文切换性能达到 10 倍于上代产品
同时执行多个核心程序（kernel，指的是程序中在 GPU 上执行的功能）
线程块（CTA）乱序执行
双重叠式内存传输引擎

NVIDIA 官方白皮书：

http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFermiArchitectureWhitepaper.pdf

本站文章：
http://www.pcinlife.com/article/graphics/2009-09-29/1254197427d834.html

推荐阅读 Realworldtech 的报道：
Inside Fermi: Nvidia's HPC Push

作者: voodoo12345 时间: 2009-10-2 16:30
Predication（论断）是什么？分支预测？

作者: Edison 时间: 2009-10-2 16:32
http://www.nvidia.com/content/PD ... ingArchitecture.pdf

按照这篇报道，Fermi 的 DRAM ECC 实现机制和传统 CPU 每 8-bit 增加一个位元的方式有些不一样，是一种专利的方式：

Fermi is the first GPU to provide ECC (error correcting code) protection for 
DRAM; the chip’s register files, shared memories, L1 and L2 caches are also ECC 
protected. The level of protection is known as SECDED: single (bit) error correction
double error detection. SECDED is the usual level of protection in most ECC‐r
equipped systems. 

Fermi’s ECC protection for DRAM is unique among GPUs; so is its 
implementation. Instead of each 64゜it memory channel carrying eight extra bits fo
ECC information, NVIDIA has a proprietary (and undisclosed) solution for packing 
the ECC bits into reserved lines of memory. 

作者: Edison 时间: 2009-10-2 16:41

Predication（论断）是什么？分支预测？
voodoo12345 发表于 2009-10-2 16:30

Prediction（预测）和 Predication（论断）是两回事情，虽然有点相似。

http://en.wikipedia.org/wiki/Branch_predication

http://en.wikipedia.org/wiki/Branch_prediction

branch predication 的话，你是 branch 的两边都跑，当确定一个方向后（用 predication 算出一个"筛子"），就只拿这个方向的结果，另一方向的就抛弃掉。

branch prediction 就是猜测哪个方向会发生状况，然后就卯足劲跑这个方向，猜中了就中奖，猜不中就重新跑。

作者: iphone 时间: 2009-10-2 17:00
从这个Fermi100的推出可以看出NV今后的方向，他们走的这一步算是很激进的。如果保守的话他们是可以通过重点发展消费图形处理慢慢蚕食的。不过基于目前的状况来看这一步激进路线前途很坎坷。

作者: Edison 时间: 2009-10-2 17:26
这里是架构讨论区，着重讨论体系架构以及相关的技术讨论，所以如果是发表市场策略看法，请到下面的显卡区发表。

作者: voodoo12345 时间: 2009-10-2 19:23
问个关于C++的问题，这个GPU对C++的完全支持，是应用在开发者利用GPU编写C++代码提高开发效率呢？还是应用在用户执行C++程序时GPU提高程序运行效率呢？

作者: westlee 时间: 2009-10-2 20:45
提示: 作者被禁止或删除内容自动屏蔽

作者: Edison 时间: 2009-10-2 21:01

问个关于C++的问题，这个GPU对C++的完全支持，是应用在开发者利用GPU编写C++代码提高开发效率呢？还是应用在用户执行C++程序时GPU提高程序运行效率呢？
voodoo12345 发表于 2009-10-2 19:23

应该是需要 CUDA 的编译器编译为 PTX 2.0 代码，然后驱动编译为本机代码来执行。

例如，以前的 GPU 是没有 function 的，函式调用都是 inline 的，不是真正的函式。

Fermi 实现了这些特性的支持。

作者: 太虚公 时间: 2009-10-2 21:02
有中文资料没？

作者: Edison 时间: 2009-10-2 21:05

能否达到和晶体管数量相称的游戏性能增长？或者说，游戏性能增长相对晶体管性能增长，打了几折？
westlee 发表于 2009-10-2 20:45

目前还没有 gaming 方面更详细的资料看到，不过我想得看具体的游戏吧，当前的许多游戏都不能从 cache 获益，但是并不意味着那些采用了 gpu computing 的游戏不能因此而获益。

作者: Prescott 时间: 2009-10-2 22:13
本帖最后由 Prescott 于 2009-10-3 00:20 编辑

真正的亮点就是
1. 大幅增加DP单元
2. 统一地址空间
3. ECC
可以说完全是针对GPGPU的。

对于3D图形渲染性能的改进就是double SM的SIMD宽度到16，这个带来的渲染性能提高还有待检验。肯定是不到一倍的，我估计乐观一点在60-80%之间。

作者: RacingPHT 时间: 2009-10-2 22:57
提示: 作者被禁止或删除内容自动屏蔽

作者: RacingPHT 时间: 2009-10-2 23:00
提示: 作者被禁止或删除内容自动屏蔽

作者: Prescott 时间: 2009-10-3 00:10
RacingPHT，我同意你关于性能的意见。这些不会有性能方面的提高，但是对于提高通用编程能力是至关重要的。
关于统一寻址：在实现统一寻址之前，很多C/C++语法根本无法编译成PTX，比如对一个float*的defererence，在编译的时候，编译器很难知道这个指针到底指向什么东西，它这次有可能指向share memory，下次就有可能指向global memory，这个时候，指令生成就成了一个大问题，因为这是完全不同的指针。除非对指针加上修饰，比如 __share__ float*，然后限制不同指针互相赋值也不能相互转换，这显然限制更大，func(__share__ float*)和func(__global__ float*)变成了不同的函数。

所以，统一寻址对于提高可编程能力至关重要。至于原先不能写的东西，这个和C/C++的语义是符合的，const float *本来就是不能写，你非要强制转成float*然后写自然是后果自负，Fermi当然是理直气壮的死给你看。可以说CUDA从此之后可以有一个很大的进步，虽然现在还没有发布新的CUDA版本，但是可以值得期待。

关于SM的宽度：对于我来说，SM（Stream Multiprocessor）中SP的个数就是SIMD宽度。所以我的理解是，Fermi现在有32个SM，每个SM的宽度是16(SP)，前面我好像弄错了。我还没来的及弄明白Dual Wrap Scheduler是怎么回事，看起来像是把16个SP又分成了两组。Wrap只不过是一个CUDA上的概念，应该和硬件实现无关。

我对Atomic操作在图形学中的重要性没有概念，但是对于Fermi针对的高性能计算市场，应该不是特别重要。

作者: Prescott 时间: 2009-10-3 00:13

另外，统一地址空间很容易诱惑程序员写出巨慢的代码。C++也是。

这样我想起了PS3的统一地址空间，程序员不小心用CPU读写到显存出现性能巨幅下降，搞不清楚状况骂娘的情形。
RacingPHT 发表于 2009-10-2 23:00

这个，所有硬件都有这样的问题，或多或少，CPU一个Core比GPU Core大那么多，不都是为了避免这些情况嘛。
这方面Nehalme >> LRB > Fermi

作者: RacingPHT 时间: 2009-10-3 00:41
提示: 作者被禁止或删除内容自动屏蔽

作者: lik 时间: 2009-10-3 12:43
GPU在乎的是并行计算, 关键在于把问题并行化. 只要这点做好了, 程序自然会快. 就单个进程而言, SM的IPC性能估计还不如Pentium Pro. But, who cares? 如果要追求特别高效, 那就不用C++特性嘛. 这个在CPU上也是一样的道理. 所以那些做embedded real-time systems 的还是大多用C而不是C++. 如果要在开发的进度和代码效率之间找到一个平衡点的话, C++是个很好的选择. 当然也有很多人码得巨烂, 这个是普通现象, 不是GPU特有的.

总之您说这些, IMHO放到CPU上面也适用.

P大的说法我赞同 , 关于指针语义上确实是统一了. 但是即便最终实现了C++的语法, 也大概还是486的级别的性能. 最终, 程序员还是要非常清楚里面究竟是什么. 因此, C++这种东西, 可能是减少了一开始的门槛, 但是最终要获 ...
RacingPHT 发表于 2009-10-3 00:41

作者: RacingPHT 时间: 2009-10-3 15:33
提示: 作者被禁止或删除内容自动屏蔽

作者: Edison 时间: 2009-10-4 00:53
目前还不是很清楚 Fermi 的双精度实现成本，但是从 Cell 这边看，Cell eDP 和 Cell 相比，eDP 的成本主要在 SPE 上增加了 10%，而 Cell 的 DP 实现是类似于 GT200 也就是有还专门的 DP 运算单元实现。Fermi 的 eDP 是单精度/双精度单元一起跑实现的，所以在 eDP 的实现上，我想 Fermi 应该不会高到哪里去(+10% per 8 SP？）吧。

当然，这样的 10% 有 n 个存在的话，那就是很可观的成本增加了。

作者: hd4770 时间: 2009-10-4 05:49
7# voodoo12345
natively support c++ = cuda compiler can compile c++ program. In gt200, it can only compile c program. That is the only difference. Optimization is not there initially. But as an open source community, anyone can contribute to an optimized extension. So one day, there might be a huge open library that provides optimized api. Currently, video transcoding, adobe ps4, flash, matlab, etc, are such examples. The selling point of gpu in the future would like this, if one bot a SOC (junior cpu + senior gpu), one could speed up one of his existing/developing C++ programs by calling a function in an c++ cuda extension. If that gives you even 2x speedup, that sounds valuable. Let alone, future operation systems running faster in SOC than intel only.

作者: RacingPHT 时间: 2009-10-4 21:51
提示: 作者被禁止或删除内容自动屏蔽

作者: hd4770 时间: 2009-10-5 07:19

Atomic运算子的性能改进, 便可以减少多个SP争用一个地址的情况下的串行执行成本.RacingPHT 发表于 2009-10-4 21:51

Can you elaborate it? Here is my understanding about nv atom, say, threads A, B, C are accessing address A0, the order of excution for the 3 threads is random. If B is accessing it first, then A, C are halted until it is done. So the best total execution time for the 3 threads are A exec time + B exec time + C exec time. Is this close to what you meant here?

作者: RacingPHT 时间: 2009-10-5 09:42
提示: 作者被禁止或删除内容自动屏蔽

作者: hd4770 时间: 2009-10-5 12:07

hd4770:
我不知道NV的具体实现, 有没有一些其他的优化, 例如atomic操作会不会导致线程切换, 因此这个时间可以掩盖掉.
不过在大量访问的情形下, 是这个意思.
RacingPHT 发表于 2009-10-5 09:42

Thanks for clarifying it. From the published document, we can see threads are dymanically loading into SPs. So Yes, if tons of threads are active there, the halt of a few threads due to atomic operation confliction can be easily hidden by other threads. I think the improtance of atimics is to theoretically allow a big task to be broken down into small pieces. Although it sounds odd for multiple pieces working on the same address, but it cannot be ruled out. Today's operation systems tend to break tasks into thousands of small pieces, which could be speeded up by GPUs. Anyhow, your observation is very interesting.

作者: RacingPHT 时间: 2009-10-5 18:30
提示: 作者被禁止或删除内容自动屏蔽

作者: ic.expert 时间: 2009-10-5 23:17
本帖最后由 ic.expert 于 2009-10-5 23:26 编辑

非常同意 RacingPHT 大哥看法：〉

另外建议陈总舵主的文章里面Append Buffer建议翻译为附加缓冲区，因为这东西是用作流处理的，缓冲两个Kerenl之间的数据，实际上应该叫做Productor Buffer更确切。

作者: Edison 时间: 2009-10-5 23:57
这个东西在以前是 geometry shader 用的，用来生成新的三角面。

作者: RacingPHT 时间: 2009-10-6 10:48
提示: 作者被禁止或删除内容自动屏蔽

作者: hd4770 时间: 2009-10-6 11:24

HD4770:
既然NV强调了Atomic op在同一地址下的性能改进, 那么有理由认为这个操作是有必要的.

例如producer-consumer模式, 一些CUDA线程在制造任务包, 另一些CUDA线程在消耗这些任务包, 可能会出现这种情况.也许需 ...
RacingPHT 发表于 2009-10-5 18:30

Agree. One of the obvious usages for atom is the global sync. Or someone would call it syncBlock. Given blocks 0, 1, ..., N, each block's thread 0 atomically increments on a counter, sync all threads in the block, then every thread of the block polling that counter.

作者: Edison 时间: 2009-10-7 00:33
http://www.realworldtech.com/for ... 103203&roomid=2

今天在 RWT 的讨论串看到 DK 说根据 AMD CTO 的说法，RV770 可以做到 CKE，不过在随后的讨论中，有人认为这个仅仅是指 thread pool 里有 VS/PS+Computing 的 kernel 而执行是串列的 CKE，非同一时刻多个 computing kernel 并行执行的 CKE。

至于底层上 RV770 也许是有一些这方面的暗桩，但是目前的软件上是还没看到实质的支持，在现在而言，硬件上的功能如果没有软件支持，其实还是等于没有，希望 AMD 能尽快在这方面提供一些必要的技术支持，假如能 CKE 的话^^。

作者: Eji 时间: 2009-10-8 13:42

http://www.realworldtech.com/forums/index.cfm?action=detail&id=103306&threadid=103203&roomid=2

今天在 RWT 的讨论串看到 DK 说根据 AMD CTO 的说法，RV770 可以做到 CKE，不过在随后的讨论中，有人认为这个 ...
Edison 发表于 2009-10-7 00:33

確實ATI的concurrent kernels execution能力是目前最大的疑慮了....

作者: Edison 时间: 2009-10-9 02:25
原来 9 月份的时候 OKA 曾经流出过 OLCF-3 的超级电脑幻灯片，里面有些信息现在看来还是有些意思的：

http://www.cisl.ucar.edu/dir/CAS2K9/Presentations/bland.pdf

作者: sharko 时间: 2009-10-9 16:40
这个是不是倾向于科学计算的？ms图形方面没什么特别啊

作者: Edison 时间: 2009-10-9 16:59
no，只是 graphics 部分的细节未公开，但是遵循 DX11 是没问题的。

作者: me210 时间: 2009-10-17 19:49
学习中~！

作者: Edison 时间: 2009-10-19 22:46
AMD 的 CKE 可能是这样实现的：

作者: 柏诚 时间: 2009-10-20 22:23
高手如云

作者: SupperSix 时间: 2009-10-21 13:45
我只想知道，以当前架构能带来多少实际游戏性能的提升

作者: Edison 时间: 2009-10-21 14:02
游戏性能的改变幅度其实非常依赖于具体游戏以及具体的设置，不能以单个游戏而论，更何况目前没有更多的图形架构上的细节释出，所以这个问题目前是不会有可信的答案提供。

作者: lik 时间: 2009-10-22 13:30
你说的x86是泛指一般的high performance CPU吧。主要就是那些有speculation和OOO的能力的. 支持X86 ISA的处理器只是其中的一部分. 我的意思是说，GPU运行单个thread的性能本来就是一般的, GPU本来就是靠并行计算来提供throughtput的，所以烂代码其实无所谓. 从你说的吃烂代码的角度看，如果衡量throughput, GPU的multi-threading不比CPU的speculation的效果差。举个简单的例子，程序A不考虑memory locality和data reuse, CPU可以prefetch, 可以speculate load, 可以OOO load, 可以上大cache, 来解决long memory latency的问题. GPU里面则可以通过multi-threading，一旦有个thread读内存了，就切换到另外一个thread, 等memory的数读回来了，再切换回原先那个thread. 只要有足够多的threads来hide latency, 这个memory latency 根本不是问题，即使没有CPU那些tricks， GPU的throughput也照样很高. 再比如，一个烂程序有很多不必要的branches, CPU里面通过BTB和branch predictor可以把branch miss penalty降得很低。GPU虽然没有这些设计，可是GPU有很多threads, 一个thread遇到branch instruction的时候，这个thread的instruction fetch就被停掉。如果在CPU里面这么做，pipeline就会出现bubble导致性能下降. 但是GPU有很多threads，总是能把pipeline填满，所以这根本不是问题.

总而言之，在GPU上面单个thread的性能的确是惨不忍睹，但是如果你的CUDA程序写得好，把一个问题通过大量的并行的threads来解决，这个单个thread的性能根本不是问题.

"IMHO放到CPU上面也适用."

如果您说的CPU是x86的话, 我认为是不适用的. x86做的这么多事情, 就是为了用来吃烂代码的. 同样烂的代码放到Power, CELL之类的架构上, 一般都会变得非常糟糕, 到CUDA上则是只能说惨不忍 ...
RacingPHT 发表于 2009-10-3 15:33

作者: lik 时间: 2009-10-22 13:36
本帖最后由 lik 于 2009-10-22 14:41 编辑

修改: 我一开始没仔细理解你的串行部分的意思. Sorry. 你说的是那些无法并行的部分. 那部分应该是在host (也就是CPU)上执行. CUDA是一个co-procssor architecture, 不是只有GPU. GPU只负责执行CUDA kernel, 就是能并行的那部分. Host负责执行不能并行的部分. Jensen Huang在GTC的keynote上不是举了那个例子: 一个系统只有5个CPU cores或者只有500 个GPU cores都是不理想的, 而co-processor architecture比如1个CPU core+450 GPU cores才是相对更好的. 这一个CPU就是用来执行你说的那些无法并行的部分.

当然这里一来和CPU的通信就成为一个比较耗时的瓶颈.

我原来的意思就是那些CUDA kernel的single thread performance并不是很重要. Sorry for the confusion.

IPC性能估计还不如Pentium Pro. But, who cares?

其实这句话是很有问题的.
即便有无穷多个运算单元可以将并行的时间减少为0, 一个程序总有一些部分是无法并行的. 这部分决定了一个程序的最高速度.
在CUDA ...
RacingPHT 发表于 2009-10-4 21:51

作者: RacingPHT 时间: 2009-10-22 14:05
提示: 作者被禁止或删除内容自动屏蔽

作者: RacingPHT 时间: 2009-10-22 14:09
提示: 作者被禁止或删除内容自动屏蔽

作者: lik 时间: 2009-10-22 14:33
嗯, 我指的是warp, 用thread在这里是不够妥当. 可能过去做CPU的SMT说thread说习惯了. 我说的情况是指一个warp里面所有的threads的branch的结果是一样的, 比如都taken或者都not taken, 那么SIMD的行为还是继续保持. 而在branch resolve之前GPU可以运行其他的warps. 所以pipeline没有bubble. 你说的是一个SIMD warp里面的threads有不同的branch结果, 那的确是对GPU来说最糟糕的情况. 这时候SIMD变成MIMD, GPU退化成简单CPU了. 不过这种情况不多见 (AFAIK).

lik: 你说的关于thread和branch在GPU上的pipeline bubble我认为是错误的。
GPU遇到branch的时候的bubble比对CPU影响要大得多。因为SIMD的内在执行机制。

Anyway, 我个人不是一个CUDA的高手，我没有什么办法把我的 ...
RacingPHT 发表于 2009-10-22 14:09

作者: RacingPHT 时间: 2009-10-22 14:39
提示: 作者被禁止或删除内容自动屏蔽

作者: lik 时间: 2009-10-22 14:45
本帖最后由 lik 于 2009-10-22 14:47 编辑

你说得对, CUDA就是一个host+协处理器的architecture. 所谓通用计算不是说GPU一切通吃, 而是说GPU可以做除了graphics rendering之外的运算. 80%的算术在GPU上, 20%在host CPU上. BTW 我刚修改了刚才的一个贴子, 澄清了我刚才没说对的地方.

如果都是算算矩阵乘法，做做视频转码的Motion estimation, 当然不会有太多的Warp被打碎的情况．这也是GPU的拿手好戏了．

但是，作为"通用计算"，又怎么可能是这种情况呢．最典型的，80%算术，20%flow control．
...
RacingPHT 发表于 2009-10-22 14:39

作者: Edison 时间: 2009-10-22 15:16

如果是这样的话，reg file的copy的通信量岂不是很大。
RacingPHT 发表于 2009-10-22 14:05

同一时间依然还是存取同样数量的 register，它这里的 register file 肯定是可以做到 n 个（最多 8 个 IP 指向不同的 kernel 位置？） kernel 共用，所以看上去就像是同一个 kernel 不同 thread 一样。

作者: tilong-lee 时间: 2009-10-26 04:27
进来学习了，

作者: Edison 时间: 2009-10-28 01:04
http://www.hpctech.com/2009/1028/332.html

2017 年他们玩的是....

[attach]1149313[/attach]

大家不要忘记留意最底下的"摩尔定律"那段文字。

作者: RacingPHT 时间: 2009-10-28 09:18
提示: 作者被禁止或删除内容自动屏蔽

作者: Edison 时间: 2009-10-28 11:09
http://www.csm.ornl.gov/workshop ... a_workshop_0708.pdf

作者: tilong-lee 时间: 2009-10-30 00:45
如果是增加流处理数量呢??

作者: yamhill 时间: 2009-10-30 14:13
增加流处理器数量的话，恐怕不是很容易

NV每一个流处理器配套的东西都一大堆

作者: astrofy 时间: 2009-11-1 13:35
只希望这次能快点衍生出中低端产品

作者: dycgz 时间: 2009-11-8 21:49
也不知道什么时候能出来？

作者: Eji 时间: 2009-11-9 03:44
從往例堆斷的話，這回Fermi中低階衍生的時間點可能和GT200轉GT21x的時間一樣會拖得久一點也說不定。
實話是中階性能競爭用的產品最需要的，或許只是GT200 class + GDDR5；
但是高階總是得要有個DX11才好宣傳....
而且multi-thread應用一開始，沒有concurrent kernel execution能力的GT200以前產品都會相對吃虧。

以結論來說還是要看DX11遊戲推行的速度；至於生產性部份，個人是不會太看壞Fermi推出中階產品的困難度。Fermi"本身"要大量生產很麻煩沒錯，但是它畢竟一開始就已經比GT200的die size要來得小。

作者: 66666 时间: 2009-11-9 08:59
比GT200还小？

有点不敢相信

作者: aibo 时间: 2009-11-9 11:50

比GT200还小？

有点不敢相信
66666 发表于 2009-11-9 08:59

GT200是至今为止最大的 576mm2

作者: RacingPHT 时间: 2009-11-9 13:09
提示: 作者被禁止或删除内容自动屏蔽

作者: knightmaster 时间: 2009-11-9 14:56
中阶DX11

应该是256SP/256bit 这个规模吧

这个规模的话,效能压住RV870没问题

作者: melissa 时间: 2009-11-9 16:32

中阶DX11

应该是256SP/256bit 这个规模吧

这个规模的话,效能压住RV870没问题
knightmaster 发表于 2009-11-9 14:56

= =......这个.....持保留意见,除非效率比GT200提高.285GTX连5850都比不过呢= =~

作者: knightmaster 时间: 2009-11-9 18:35
GT200的悲剧在频率上

就规模而言,并不比对手更小

作者: Edison 时间: 2009-11-13 12:45

文档似乎都是 CUDA 2.3 的。

作者: denev2004 时间: 2009-11-14 21:29

GT200的悲剧在频率上

就规模而言,并不比对手更小
knightmaster 发表于 2009-11-9 18:35

是不是因为规模过大了频率才杯具的？

作者: Edison 时间: 2009-11-17 00:12
The family of Tesla 20-series GPUs includes:

Tesla C2050 & C2070 GPU Computing Processors
Single GPU PCI-Express Gen-2 cards for workstation configurations
Up to 3GB and 6GB (respectively) on-board GDDR5 memoryi
Double precision performance in the range of 520GFlops - 630 GFlops
Tesla S2050 & S2070 GPU Computing Systems
Four Tesla GPUs in a 1U system product for cluster and datacenter deployments
Up to 12 GB and 24 GB (respectively) total system memory on board GDDR5 memoryii
Double precision performance in the range of 2.1 TFlops - 2.5 TFlops
The Tesla C2050 and C2070 products will retail for $2,499 and $3,999 and the Tesla S2050 and S2070 will retail for $12,995 and $18,995. Products will be available in Q2 2010. For more information about the new Tesla 20-series products, visit the Tesla product pages.

Editors’ note: As previously announced, the first Fermi-based consumer (GeForce®) products are expected to be available first quarter 2010.

http://www.nvidia.com/object/io_1258360868914.html

另外，typical power draw at 190W, with a maximum of 225W.

作者: zjcr 时间: 2009-11-23 17:40
C++只是一种面向人与计算机沟通的语言，跟处理器的处理方式毫无直接联系（一旦编译了之后）

作者: denev2004 时间: 2009-11-27 19:14
回ls,我感觉他的说法其实意思是说为C++指针提供空间，增强其的效率。

作者: Asuka 时间: 2009-11-28 09:24
我求证一下，NV官网提到的tesla 20架构，ECC居然是7+1的？

怎么回事？平常不应该都是8+1吗？

数据还能拆开的？

作者: Edison 时间: 2009-11-30 18:10
一篇关于 GT200 的原子操作性能测试报道：

http://strobe.cc/articles/cuda_atomics/

Three memory access patterns will be tested. The first goes straight for the jugular: all writes across an SM go to the same address, ensuring that all atomic operations cause a conflict. Each SM gets its own address, though, because having all processors write to the same location caused several system crashes during testing. This is expected to be nearly the worst case for atomic operations, and the results do not disappoint:

Ick. Let’s not do that again.
The next access pattern is less pessimal; each memory location is separated by 128 bytes, and each thread gets its own memory location, ensuring that no conflicts occur but also preventing the chip from coalescing any memory operations.

Well, that’s… tolerable. It remains to be seen whether atomics can be used for scatters in computation threads, but this looks like it wouldn’t cause too much damage. One last access pattern: this time, all threads are neatly coalesced, each accessing a 4-byte memory location in order, such that a warp hits a single 256-byte-wide, 256-byte-aligned region of memory.

Crap. That’s quite a bit worse. Sure, the total latency for an atomic operation is better, but the ratio between an uncoalesced atomic and read-modify-write latency is much smaller than that for the coalesced pattern, so the relative cost of atomic operations in this context is much worse.

作者: zxjike 时间: 2009-12-1 16:34
提示: 作者被禁止或删除内容自动屏蔽

作者: frankexem 时间: 2009-12-11 16:35
在这里 C++都成了托管代码了

作者: nonolaw 时间: 2009-12-16 09:12
貌似技术文档这个跨越性很大，希望理论能够出实际

作者: 胡小华 时间: 2009-12-16 16:02
hd4770:0 q& @+ E3 r4 d
我不知道NV的具体实现, 有没有一些其他的优化, 例如atomic操作会不会导致线程切换, 因此这个时间可以掩盖掉.
/ g( o( e/ V/ x( P, i b' [不过在大量访问的情形下, 是这个意思.0 w" C4 O5 ) {" V
RacingPHT 发表于 2009-10-5 09:42
$ t) X" `0 ^: C; { Y2 N
Thanks for clarifying it. From the published document, we can see threads are dymanically loading into SPs. So Yes, if tons of threads are active there, the halt of a few threads due to atomic operation confliction can be easily hidden by other threads. I think the improtance of atimics is to theoretically allow a big task to be broken down into small pieces. Although it sounds odd for multiple pieces working on the same address, but it cannot be ruled out. Today's operation systems tend to break tasks into thousands of small pieces, which could be speeded up by GPUs. Anyhow, your observation is very interesting.

作者: Edison 时间: 2009-12-20 22:24

multi-node 应该有一定的获益。

作者: Edison 时间: 2009-12-20 22:25

作者: lemonninja 时间: 2009-12-21 11:50
回复 77# Edison

有料暴了没？

作者: zhj02002 时间: 2009-12-30 09:48
学习了~~~~~~

作者: cky3 时间: 2010-1-3 20:16
好深奥哦

作者: disruptor 时间: 2010-1-5 23:52
这个predication功能是不是有些像神经网络中的权呢？

作者: xiuxiulinlin 时间: 2010-1-8 13:53
恩，好好研究下。

作者: max3396 时间: 2010-1-11 21:01
期待上市。。。

作者: tsfbbb 时间: 2010-1-12 15:44
这个学习下

作者: bosice 时间: 2010-1-12 16:10
额 GT没弄好 NV就得出个新的

作者: gradxia 时间: 2010-1-14 20:41
学习了，顶一个

作者: westlee 时间: 2010-1-16 23:00
提示: 作者被禁止或删除内容自动屏蔽

作者: Edison 时间: 2010-1-17 01:41
从 triangle setup engine 的角度出发，Fermi 非常像"多"核。

作者: Edison 时间: 2010-1-18 13:08
GF100 图形架构：

In a traditional pipeline setup for GPUs the Geometry Shader, Vertex Shader, Setup/Rasterizer functions would come at the front end of the pipeline. This creates a situation where data will be stored and read from memory on the video card. This is just how things have been done for the longest time, and NVIDIA believes the traditional setup creates a bottleneck in geometry performance.

Not so simply, what NVIDIA have done is to separate the Raster Engine from the pipeline and move it down into the GPCs in four parts, and they have created a new engine they are calling the "PolyMorph Engine" which is integrated into the SMs. First a little breakup of the hierarchy, the GF100 is made up of 4 GPCs (Graphics Processing Clusters) which break down into 4 SMs (Streaming Multiprocessors) which break down into 32 CUDA cores and 4 Texture Units and some other stuff. So, 32 CUDA cores plus 4 Texture Units plus the PolyMorph Engine make up an SM, and 4 SMs make up a GPC. With this kind of parallelism you can see how the GPU can be sliced and diced to create less expensive parts.

Inside each GPC you will find the actual Raster Engine, so there are basically 4 Raster Engines inside the GF100. Inside each SM (a culmination of 32 CUDA cores and 4 Texture Units) you will find the new PolyMorph Engine. The PolyMorph Engine contains the actual Vertex Fetch, Tessellator, Viewport Transform, Attribute Setup and Stream Output functions. Again, all of these functions, including the Rasterizer use to be in one area on the GPU sitting at the front end of the entire process in the pipeline.

NVIDIA claims 8X the geometry performance of GT 200. This re-ordering of the graphics pipeline caused an increase of 10% of the die size and from our understanding of the issue, is the reason GF100 is "late." The problem with all this moving about of the pipeline is that now you have a Tessellator and Triangle Setup in each SM and GPC so all your Triangles that get setup are all setup out of order.

简而言之，在 GF100 里，32 个 CUDA core + 4TMU + PolyMorph 引擎，构成一个 SM。4 个 SM 构成一个 GPC (Graphics Processing Clusters，图形处理簇)，每个 GPC 内有一个光栅化处理引擎。

PolyMorph 引擎包含了 "实际" 的顶点拾取、拆分器、视口转换、属性设置以及 stream output 功能。

NVIDIA 声称 GF100 的几何性能达到 GT200 的8倍，而这部分图形流水线会增加 10% 的管芯面积，HardOCP 认为这是导致 GF100 晚到的原因。

另外，按照 semiaccurate 的说法，他们认为 GF100 的 PolyMorph 引擎实际上是在 CUDA Core 内增加了光栅化处理和前端几何处理的加速指令，并不存在物理上的 PolyMorph 引擎，当然，这只是 semiaccurate 自己的观点，并未得到实证来证明。而 semiaccurate 一贯以来都认为 GF100 是 Larrabee done worng，作者在 G80/R600 时代就发表过 R600 会揪翻 G80 等观点。 :p

http://www.semiaccurate.com/2010 ... d-unmanufacturable/

作者: Edison 时间: 2010-1-18 14:02
从规模上看，GF100 的光栅化（也就是屏幕空间处理）引擎是 GT200 的 4 倍。

这意味着如果并行化充分并且引擎规格相等的话，同频下的 GF100 几何吞吐性能应该可以做到 GT200 的 4 倍。

不过 NVIDIA 表明 GF100 的几何性能是 GT200 的 8 倍，那就有两种可能：

1、GT200 的光栅化引擎能力是 GF100 单个光栅化引擎的 1/2。
2、GF100 的光栅化引擎运行在 shader 频率上。

GF100 的纹理单元频率是运行于 shader 1/2 频率上，而不是 GT200 那样运行于内核频率上。

作者: 苯苯小哥 时间: 2010-1-18 15:44
本帖最后由苯苯小哥于 2010-1-18 15:47 编辑

E大不开贴分析分析新出GF100一些官方材料及PDF？
GF100成品是完整512SP？这次比特斯拉还多

作者: Edison 时间: 2010-1-18 16:14
这是 pcinlife fermi 架构相关的第二个主题，目前没有另外开新主题的必要，所有相关的讨论都在这里进行。

事实上如果你有看的话，在你发帖之前我已经给出了一个初步的讨论。

作者: jhg1159 时间: 2010-1-18 16:25
shader 1/2 频率上和内核频率也差不多啊。
shader与内核频率比会超过1：2.5么？

作者: Edison 时间: 2010-1-18 16:54
The shader clock now drives the majority of the chip, including the shaders, the texture units, and the new PolyMorph and Raster Engines. Specifically, the texture units, PolyMorph Engine, and Raster Engine all run at 1/2 shader clock (which NVIDIA is tentatively calling the "GPC Clock"), while the L1 cache and the shaders themselves run at the full shader clock.

所以 setup 应该是 1/2 shader 或者说 GPC clock 上。

作者: Edison 时间: 2010-1-18 21:42
这个特性对 G80+ 来说是新特性，但是对 R520+ 来说似乎不是。

作者: pharaohs1024 时间: 2010-1-19 01:27
提示: 作者被禁止或删除内容自动屏蔽

作者: Edison 时间: 2010-1-19 01:45
在 dx11 之前，增加三角面导致的问题就是 cpu 这边扛不住，所以游戏的三角面数量一般都维持在每帧不超过 200 万的水平吧，例如 Crysis，一秒三十帧就是相当于 60M triangle/s 的水平。这个吞吐率基本上只是相当于显卡几何吞吐率的 1/10 吧。

DX11 后因为有了拆分器这个东西，三角面的完整处理基本上可以摆在 gpu 上进行，自然对光栅化的性能需求有提升了。

作者: Eji 时间: 2010-1-19 10:52
我覺得這回看Fermi白皮書讓我非常激賞的部分，是它確實可以做到面面俱到這點。
GF100大小驚人，不用40nm生產的話用55nm根本沒辦法生產，所以S|A會講它unmanufatureable。
但是回頭看GF100的架構，我們可以有十足把握同規模的狀況下它會比G80/GT200都還要快。

比方說1GPC對上G92、或者2GPCs的版本對上GT200，我們可以看到Fermi會有
1. shader快速context switch、全新的快取架構、PolyMorph Engine
2. 支援DX11幾個重要變更的TMU
3. 大幅強化的新ROP以及GDDR5

然後這些變更讓整個結構只大了10%。嘿這很嚇人耶。

光是shader部分的變更和ROP強化這兩點就可以讓這個晶片直接跑目前支援PhsyX的遊戲比GT200b還快一大截，換成DX11的話差距還會再加大，同樣的狀況也會發生在1GPC的版本 vs G92上，別忘了GT2x0家族到最後還是沒推出比G92快的產品....

當然2GPCs的產品能不能比RV870快是個疑問沒錯，但是Fermi在和前代產品同大小的狀況下確實讓人有把握一定會比前代產品快，這可是在任何一家GPU廠商的歷代產品線裡面都很少見。

作者: 苯苯小哥 时间: 2010-1-19 11:36
GF100看来有机会干平5970
GF104 双256 这个中端往下衍生不知道会不会有原生双128，这样从高到底很快铺开全线DX11

作者: skywalker_hao 时间: 2010-1-19 14:22

GF100看来有机会干平5970
GF104 双256 这个中端往下衍生不知道会不会有原生双128，这样从高到底很快铺开 ...
苯苯小哥发表于 2010-1-19 11:36

双128一定存在的
反正这个结构看上去像4核一样的，双128正好是单核

欢迎光临 POPPUR爱换 (https://we.poppur.com/)