NVIDIA Fermi GF100 及 GF1XX 架构讨论

skywalker_hao · 发表于 2010-1-19 14:22

GF100看来有机会干平5970
GF104 双256 这个中端往下衍生不知道会不会有原生双128，这样从高到底很快铺开 ...
苯苯小哥发表于 2010-1-19 11:36

双128一定存在的
反正这个结构看上去像4核一样的，双128正好是单核

disruptor · 发表于 2010-1-19 15:48

Nastran 在q3才会有支持cuda的版本？那轮到ansys岂不是要猴年马月

Edison · 发表于 2010-1-19 17:25

2nd half，没准要 Q4 才有

lufan · 发表于 2010-1-26 22:17

仔细看了第一页回帖，，够专业了~~~

panjanstoneborg · 发表于 2010-2-4 09:51

fermi在同规格下能比g80改进多少？
比如1gpc的版本，128sp 64tf 16rop
这些参数和g92相近的，但是g92 在738/1836/2200下和rv770 625/1993跑分基本打平
如果没有一定改进，那么128sp fermi在shader跑步到1800的情况下和5770比起来，能占到多大便宜？

skywalker_hao · 发表于 2010-2-4 14:34

本帖最后由 skywalker_hao 于 2010-2-4 14:36 编辑

fermi在同规格下能比g80改进多少？
比如1gpc的版本，128sp 64tf 16rop
这些参数和g92相近的，但是g92 在7 ...
panjanstoneborg 发表于 2010-2-4 09:51

根据这次的结构，GPU频率提升比G80，GT200之类的要容易实现（G92，GT200之类的吃亏，主要还是GPU频率）
能不能赢5770，还得看nv会把频率定到什么水平了

asd1508 · 发表于 2010-2-7 20:28

频率是关键，台积电目前的制造能力限制了第一版费米的能量爆发，不过该有的都还在啊，很快就能看见了

denev2004 · 发表于 2010-2-8 18:20

fermi在同规格下能比g80改进多少？
比如1gpc的版本，128sp 64tf 16rop
这些参数和g92相近的，但是g92 在7 ...
panjanstoneborg 发表于 2010-2-4 09:51

其实有一个问题，我感觉Fermi的单SP性能其实是在下降的。也不太好说。主要还是看频率了。

panjanstoneborg · 发表于 2010-2-9 17:26

回复 107# denev2004

嗯，频率还是关键
但是不考虑显存，7381836和625才打平而已，虽然5系列似乎同频下不如4系列，但是没有1.8g也不好对付850的5770呀（后来想了想，应该也不会差太远，nv行销手段比a高明，不怕卖不好的，只是如果性能没有惊喜，就让人失望了）
还有这东西要什么时候出来，260没货了，难道要没货到q3，那今年nv的生意不做了？

dual issue能提升不大吧

denev2004 · 发表于 2010-2-13 12:01

回复 denev2004

嗯，频率还是关键
但是不考虑显存，7381836和625才打平而已，虽然5系列似乎同频下不 ...
panjanstoneborg 发表于 2010-2-9 17:26

不太清楚，双发效率提升应该是有的。不过原来也是假双发，的确可能会比较微弱。

其实我到想知道NV拿来和5770挑战的会不会就是1GPC的版本，我感觉2GPC比较象

panjanstoneborg · 发表于 2010-2-13 21:29

记得爱迪生暗示过2gpc版本和g92 65nm差不多大小吧
那么1gpc版本和5770大小应该差不多

Edison · 发表于 2010-2-13 22:33

比 g92 65nm 大。

lik · 发表于 2010-2-14 11:22

这里所说的2gpc不是gf100, 而是说其他的片子比如gf101/102/103/104吧? 如果是gf100那么不管几个gpc都是一样大的. 每个片子出来都有4个gpc, 不过由于TSMC造出来的片子有好有坏, 有的片子里面有的gpc不能用, 所以把那些不能用屏蔽掉, 就造成有的片子是1个可用的gpc, 有的是2个可用的gpc, 但是片子的面积是不会变的.

记得爱迪生暗示过2gpc版本和g92 65nm差不多大小吧
那么1gpc版本和5770大小应该差不多
panjanstoneborg 发表于 2010-2-13 21:29

Edison · 发表于 2010-2-14 13:15

fermi 的单元屏蔽相当灵活，可以以 sm 为单位进行屏蔽。

Edison · 发表于 2010-2-17 14:36

On devices of compute capability 1.x, some kernels can achieve a speedup when using (cached) texture fetches rather than regular global memory loads (e.g., when the regular loads do not coalesce well). Unless texture fetches provide other benefits such as address calculations or texture filtering (Section 5.3.2.5), this optimization can be counter-productive on devices of compute capability 2.0, however, since global memory loads are cached in L1 and the L1 cache has higher bandwidth than the texture cache.

The shared memory hardware is improved on devices of compute capability 2.0 to support multiple broadcast words and to generate fewer bank conflicts for accesses of 8-bits, 16-bits, 64-bits, or 128-bits per thread (Section G.4.3).

Edison · 发表于 2010-3-29 14:10

关于 GTX 480 ROP 性能问题的解释：

http://www.pcgameshardware.com/aid,743526/Some-gory-guts-of-Geforce-GTX-470/480-explained/News/

What about the fillrate? Pixel fillrate, that is. Some fillrate tests from 3DMark runs indicated a pixel throughput not worthy of the mighty 48 ROP-Units a GTX 480 has to do it's engines bidding. Earlier in our conversations with Nvidia, they said a full-blown Fermi chip could have a throughput of 32 pixels per clock in the shader-engine and 256 z-samples if the data is compressible. But how does this change with actual products like GTX 480 and GTX 470? According to Nvidia this throughput can change at either the GPC or SM level. A 15 SM configuration like GTX 480 would be limited to 30 pixels per clock due to the SM count, for example.

The raw ROP throughput of 48 seems to be higher than the maximum number of pixels the shader-engine can supply (max. 32) - what's Nvidias take on this? They said, the ROP-throughput was sample based whereas the shader engines are pixel based [note that a pixel can have multiple samples of z (for depth information)]. This is important for AA rendering where complex scenes will have significant portions that are uncompressed. For example in 8xAA, the peak GPC output rate is 32*8 = 256 samples per clk, whereas the peak ROP rate is 48 samples per clk. Improved performance on AA rendering was the main objective for the increased ROP horsepower.

Edison · 发表于 2010-3-29 14:30

Edison · 发表于 2010-4-5 00:15

Fermi 现在 L2 cache 是包含了 texture、instruction、vertex、computing data 等所有数据、指令，但是另一方面，L1 cache 还是划分为 texture、instruction、data，其中 instruction 和 data 分开，是目前几乎所有 CPU 的做法，这个可以理解，因为两者其实差别挺大的，而 texture 和 data 还采用分开方式，似乎有些浪费，当然 texture 的存取方式和 date cache 很不一样，有所谓模板，所以分开的话，可以把 data cache 做得比较快。

以前的 GPU 对 cache 不怎么敏感或者说重视，是因为 GPU 把数据读取后，总是可以在"里面"广播。

btw，据闻 RV870 在 register spill（寄存器溢出）的时候会非常凄惨。

32nm · 发表于 2010-4-10 12:30

fermi的TMU单元太少是一大软肋啊，为什么不让TMU单元与shader跑一样的频率呢？

SevenEleven · 发表于 2010-4-12 14:45

费米说偏向科学运算可是没软件支持啊

帐号		自动登录	找回密码
密码			注册

NVIDIA Fermi GF100 及 GF1XX 架构讨论

本帖子中包含更多资源

浏览过的版块