NVIDIA Fermi GF100 及 GF1XX 架构讨论

cky3 · 发表于 2010-1-3 20:16

好深奥哦

disruptor · 发表于 2010-1-5 23:52

这个predication功能是不是有些像神经网络中的权呢？

xiuxiulinlin · 发表于 2010-1-8 13:53

恩，好好研究下。

max3396 · 发表于 2010-1-11 21:01

期待上市。。。

tsfbbb · 发表于 2010-1-12 15:44

这个学习下

bosice · 发表于 2010-1-12 16:10

额 GT没弄好 NV就得出个新的

gradxia · 发表于 2010-1-14 20:41

学习了，顶一个

只看该作者 · 发表于 2010-1-16 23:00

提示: 作者被禁止或删除内容自动屏蔽

Edison · 发表于 2010-1-17 01:41

从 triangle setup engine 的角度出发，Fermi 非常像"多"核。

Edison · 发表于 2010-1-18 13:08

GF100 图形架构：

In a traditional pipeline setup for GPUs the Geometry Shader, Vertex Shader, Setup/Rasterizer functions would come at the front end of the pipeline. This creates a situation where data will be stored and read from memory on the video card. This is just how things have been done for the longest time, and NVIDIA believes the traditional setup creates a bottleneck in geometry performance.

Not so simply, what NVIDIA have done is to separate the Raster Engine from the pipeline and move it down into the GPCs in four parts, and they have created a new engine they are calling the "PolyMorph Engine" which is integrated into the SMs. First a little breakup of the hierarchy, the GF100 is made up of 4 GPCs (Graphics Processing Clusters) which break down into 4 SMs (Streaming Multiprocessors) which break down into 32 CUDA cores and 4 Texture Units and some other stuff. So, 32 CUDA cores plus 4 Texture Units plus the PolyMorph Engine make up an SM, and 4 SMs make up a GPC. With this kind of parallelism you can see how the GPU can be sliced and diced to create less expensive parts.

Inside each GPC you will find the actual Raster Engine, so there are basically 4 Raster Engines inside the GF100. Inside each SM (a culmination of 32 CUDA cores and 4 Texture Units) you will find the new PolyMorph Engine. The PolyMorph Engine contains the actual Vertex Fetch, Tessellator, Viewport Transform, Attribute Setup and Stream Output functions. Again, all of these functions, including the Rasterizer use to be in one area on the GPU sitting at the front end of the entire process in the pipeline.

NVIDIA claims 8X the geometry performance of GT 200. This re-ordering of the graphics pipeline caused an increase of 10% of the die size and from our understanding of the issue, is the reason GF100 is "late." The problem with all this moving about of the pipeline is that now you have a Tessellator and Triangle Setup in each SM and GPC so all your Triangles that get setup are all setup out of order.

简而言之，在 GF100 里，32 个 CUDA core + 4TMU + PolyMorph 引擎，构成一个 SM。4 个 SM 构成一个 GPC (Graphics Processing Clusters，图形处理簇)，每个 GPC 内有一个光栅化处理引擎。

PolyMorph 引擎包含了 "实际" 的顶点拾取、拆分器、视口转换、属性设置以及 stream output 功能。

NVIDIA 声称 GF100 的几何性能达到 GT200 的8倍，而这部分图形流水线会增加 10% 的管芯面积，HardOCP 认为这是导致 GF100 晚到的原因。

另外，按照 semiaccurate 的说法，他们认为 GF100 的 PolyMorph 引擎实际上是在 CUDA Core 内增加了光栅化处理和前端几何处理的加速指令，并不存在物理上的 PolyMorph 引擎，当然，这只是 semiaccurate 自己的观点，并未得到实证来证明。而 semiaccurate 一贯以来都认为 GF100 是 Larrabee done worng，作者在 G80/R600 时代就发表过 R600 会揪翻 G80 等观点。 :p

http://www.semiaccurate.com/2010 ... d-unmanufacturable/

Edison · 发表于 2010-1-18 14:02

从规模上看，GF100 的光栅化（也就是屏幕空间处理）引擎是 GT200 的 4 倍。

这意味着如果并行化充分并且引擎规格相等的话，同频下的 GF100 几何吞吐性能应该可以做到 GT200 的 4 倍。

不过 NVIDIA 表明 GF100 的几何性能是 GT200 的 8 倍，那就有两种可能：

1、GT200 的光栅化引擎能力是 GF100 单个光栅化引擎的 1/2。
2、GF100 的光栅化引擎运行在 shader 频率上。

GF100 的纹理单元频率是运行于 shader 1/2 频率上，而不是 GT200 那样运行于内核频率上。

苯苯小哥 · 发表于 2010-1-18 15:44

本帖最后由苯苯小哥于 2010-1-18 15:47 编辑

E大不开贴分析分析新出GF100一些官方材料及PDF？
GF100成品是完整512SP？这次比特斯拉还多

Edison · 发表于 2010-1-18 16:14

这是 pcinlife fermi 架构相关的第二个主题，目前没有另外开新主题的必要，所有相关的讨论都在这里进行。

事实上如果你有看的话，在你发帖之前我已经给出了一个初步的讨论。

jhg1159 · 发表于 2010-1-18 16:25

shader 1/2 频率上和内核频率也差不多啊。
shader与内核频率比会超过1：2.5么？

Edison · 发表于 2010-1-18 16:54

The shader clock now drives the majority of the chip, including the shaders, the texture units, and the new PolyMorph and Raster Engines. Specifically, the texture units, PolyMorph Engine, and Raster Engine all run at 1/2 shader clock (which NVIDIA is tentatively calling the "GPC Clock"), while the L1 cache and the shaders themselves run at the full shader clock.

所以 setup 应该是 1/2 shader 或者说 GPC clock 上。

Edison · 发表于 2010-1-18 21:42

这个特性对 G80+ 来说是新特性，但是对 R520+ 来说似乎不是。

只看该作者 · 发表于 2010-1-19 01:27

提示: 作者被禁止或删除内容自动屏蔽

Edison · 发表于 2010-1-19 01:45

在 dx11 之前，增加三角面导致的问题就是 cpu 这边扛不住，所以游戏的三角面数量一般都维持在每帧不超过 200 万的水平吧，例如 Crysis，一秒三十帧就是相当于 60M triangle/s 的水平。这个吞吐率基本上只是相当于显卡几何吞吐率的 1/10 吧。

DX11 后因为有了拆分器这个东西，三角面的完整处理基本上可以摆在 gpu 上进行，自然对光栅化的性能需求有提升了。

Eji · 发表于 2010-1-19 10:52

我覺得這回看Fermi白皮書讓我非常激賞的部分，是它確實可以做到面面俱到這點。
GF100大小驚人，不用40nm生產的話用55nm根本沒辦法生產，所以S|A會講它unmanufatureable。
但是回頭看GF100的架構，我們可以有十足把握同規模的狀況下它會比G80/GT200都還要快。

比方說1GPC對上G92、或者2GPCs的版本對上GT200，我們可以看到Fermi會有
1. shader快速context switch、全新的快取架構、PolyMorph Engine
2. 支援DX11幾個重要變更的TMU
3. 大幅強化的新ROP以及GDDR5

然後這些變更讓整個結構只大了10%。嘿這很嚇人耶。

光是shader部分的變更和ROP強化這兩點就可以讓這個晶片直接跑目前支援PhsyX的遊戲比GT200b還快一大截，換成DX11的話差距還會再加大，同樣的狀況也會發生在1GPC的版本 vs G92上，別忘了GT2x0家族到最後還是沒推出比G92快的產品....

當然2GPCs的產品能不能比RV870快是個疑問沒錯，但是Fermi在和前代產品同大小的狀況下確實讓人有把握一定會比前代產品快，這可是在任何一家GPU廠商的歷代產品線裡面都很少見。

苯苯小哥 · 发表于 2010-1-19 11:36

GF100看来有机会干平5970
GF104 双256 这个中端往下衍生不知道会不会有原生双128，这样从高到底很快铺开全线DX11

帐号		自动登录	找回密码
密码			注册

westlee 该用户已被删除	88^# 发表于 2010-1-16 23:00 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
westlee 该用户已被删除
	回复支持反对使用道具举报显身卡

pharaohs1024 该用户已被删除	97^# 发表于 2010-1-19 01:27 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
pharaohs1024 该用户已被删除
	回复支持反对使用道具举报显身卡

NVIDIA Fermi GF100 及 GF1XX 架构讨论

浏览过的版块