NVIDIA 下一代架构"Fermi" 猜测、讨论专题

baichixm · 发表于 2009-4-27 11:34

高深呀学习下。。。。。。。。。

Edison · 发表于 2009-4-29 01:25

http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=1&p=1&f=G&l=50&d=PG01&S1=%28%22Parallel+Array+Architecture+Graphics+Processor%22.TTL.%29&OS=TTL/"Parallel+Array+Architecture+for+a+Graphics+Processor"&RS=TTL/"Parallel+Array+Architecture+for+a+Graphics+Processor"

Timothy Farrar 在上面这个专利中找到了一些有趣的东西，虽然未必和 NVIDIA 的下一代架构直接相关，不过其中描述到的一些细节是值得大家思考的：

1. Parallel Fixed Function Units - This patents refers to optional duplication of all fixed function hardware (setup, raster, etc) for an "increase in the processing power or redundancy of the system".

2. Possible Writable on Chip Memory - CUDA's shared memory is referred to as shared register file in the patent. On chip memory refers to cached read-only uniforms and constants in a previous patent, but this patent adds, "On-chip memory is advantageously used to store data that is expected to be used in multiple threads, such as coefficients of attribute equations, which are usable in pixel shader programs, and/or other program data, such as results produced by executing general purpose computing program instructions". Might not mean writable, the results of GP computing could have been from a past kernel call, however does certainly leave the door open for a writable cache.

3. General Purpose Instruction Issue - Current (or perhaps older) GL|DX to CUDA required an expensive "context switch" to my knowledge. Could be a driver issue, or could be that older hardware wasn't able to interleave execution of programs beyond vert/geo/pixel shaders, I don't know. DX11 has a good number of different shader types in the rendering pipeline, so no doubt the hardware is going to have to be able to efficiently issue instructions from multiple different programs/kernels. One wouldn't want too many running at the same time on a given core (because cache pressure increases), but enough to keep the pipeline going, to avoid bubbles during draw call transitions, and possibly support interleaving a mix of ALU and MEM heavy programs to keep hardware utilization high in all ways. Seems as if this patent supports this.

4. No MIMD - Do NOT see anything direct in the patent about Multiple Instruction Multiple Data execution. Instruction issue is SIMD to PEs, "for any given processor cycle the same instruction issued to all P processing engines [PEs]". Where P is the SIMD width. However this patent adds "instruction unit may issue multiple instructions per processing cycle". Could multiple issue enable PE's to keep higher ALU efficiency when PE's are predicated? The patent says, "a SIMD group may include fewer than P threads, in which case some of the processing engines will be idle during cycles when that SIMD group is being processed", perhaps not.

5. Supergroup SIMD -> Possible Lower Overhead or Dynamic Warp Formation? - From the patent, "SIMD groups containing more than P threads (supergroups) can be defined. A supergroup is defined by associating the group index values of two (or more) of the SIMD groups with each other. When issue logic selects a supergroup, it issues the same instruction twice on two successive cycles". This can better amortize the instruction issue cost over more instructions, and can be used in combination with double clocking the PEs. I see three possible cases here. (a) That this is in the GT200 (and maybe previous generations), 8-wide SIMD doubled clocked to 16-wide half-warp, and hardware supergroup to 32-wide warp. Maybe with limitations that groups need to be sequential in order for the supergroup. (b) Or that the half-warp to 32-wide warp was done with paired instructions on current hardware, and this is just a more efficient hardware path. (c) Or that with future hardware (possibly GT3xx) this is a way to increase SIMD efficiency through dynamic warp formation. The idea being if keeping with 8-wide SIMD and a double clocked ALU, instruction issue could pick a hi/lo pair of 16-wide SIMD half-warps to issue based on common instruction. Clearly other granularities and options are possible.

6. Work Distribution - Patent leaves open many options, "shaders of a certain type may be restricted to executing in certain processing clusters or in certain cores", or "in some embodiments, each core might have its own texture pipeline or might leverage general-purpose functional units to perform texture computations". Data distribution covered in multiple forms. Broadcast, "receive information indicating the availability of processing clusters or individual cores to handle additional threads of various types and select a destination". Ring, "input data is forwarded from one processing cluster to the next until a processing cluster with capacity to process the data accetps the data". Function of input, "processing clusters are selected based on properties of the input data".

shared register 其实也是 DX11 中 CS 对这类 memory 的称呼，而这里 Work Distribution 则更是带有明显 DX11 特征。

Lancelot365 · 发表于 2009-4-30 00:34

那是8800 U？的多少倍？

ic.expert · 发表于 2009-5-2 01:19

http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=1&p=1&f=G&l=50&d=PG01&S1=%28%22Parallel+Array+Architecture+Graphics+Processor%22.TTL.%29&OS=T ...
Edison 发表于 2009-4-29 01:25

DX里面Register/Array/Buffer/Memory是很模糊，需要GPU设计实现的时候体系结构工程师自己来划分。所以Shared register和Shader Memory可以认为是等价的。

至于这片文档嘛~~ 其实NV的每个份专利的公开文档都是泄露天机的报告。

只看该作者 · 发表于 2009-5-2 16:19

提示: 作者被禁止或删除内容自动屏蔽

thanq · 发表于 2009-5-2 22:18

我只想知道下一代是多少纳米的？

newle · 发表于 2009-5-4 21:20

把CUDA功能再发扬光大前途无量啊

Edison · 发表于 2009-5-8 17:32

刚刚看了最新的 5 月 7 日的财务召会，其中 NVIDIA 的 CEO黄仁勋对于 40 纳米工艺的进展有这样的说法：

Hans Mosesmann - Raymond James

Thanks. Jen-Hsun, on the 40-nanometer, since you brought it up, I know you are not going to talk about new products that are not announced but can you give us an update on how the ramp is going relative to previous process nodes in terms of the ramp, the volumes, and can you give us an update on what percentage of the mix by the end of the year could be coming through that process node? Thanks.

Jen-Hsun Huang

Let’s see -- the ramp is going fine. It’s during the -- you know, we are ramping 40-nanometer probably harder than anybody and so we have three products in line now in 40-nanometer and more going shortly. So there’s -- this is a very important node. TSMC is working very hard. We have a vast majority of their line cranking right now with new products, and so we are monitoring yields and they are improving nicely week-to-week-to-week, and so at this point, there’s really not much to report.

In terms of the mix towards the end of the year, let’s see -- I haven’t -- my rough math would suggest about 25%, is my guess. I mean, there’s still going to be a lot of 55-nanometer products. A lot of our MCP products, ION, for example, is still based on 55-nanometer and ION is going to be running pretty hard. I think you heard in David’s comments earlier that our Intel chipset product line is our fastest growing business and so my sense is that that’s going to continue to be successful and that is still in 55-nanometer. So I would say roughly 25% to 30% is my rough estimate going into the end of the year.

在今年年底的时候，NVIDIA 的 40 纳米产品份额大约是 25%~30%。

chnhxy2008 · 发表于 2009-5-13 20:02

支持楼主，顶上去哈哈。。。。

Edison · 发表于 2009-5-15 21:03

刚刚在 gpgpu.org 上找到了一篇非常有意思的文章

http://www.ece.ubc.ca/~aamodt/papers/gpgpusim.ispass09.pdf

In this paper we studied the performance of twelve contemporary CUDA applications by running them on a detailed performance simulator that simulates NVIDIA’s parallel thread execution (PTX) virtual instruction set architecture. We presented performance data and detailed analysis of performance bottlenecks, which differ in type and scope from application to application.

First, we found that generally performance of these applications is more sensitive to interconnection network bisection bandwidth rather than (zero load) latency: Reducing interconnect bandwidth by 50% is even more harmful than increasing the per-router latency by 5.3× from 3 cycles to 19 cycles.

Second, we showed that caching global and local memory accesses can cause performance degradation for benchmarks where these accesses do not exhibit temporal or spatial locality.

Third, we observed that sometimes running fewer CTAs concurrently than the limit imposed by on-chip resources can improve performance by reducing contention in the memory system.

Finally, aggressive inter-warp memory coalescing can improve performance in some applications by up to 41%.

不管从什么角度看，这篇文章都是很值得一读：）

只看该作者 · 发表于 2009-5-17 12:46

提示: 作者被禁止或删除内容自动屏蔽

Edison · 发表于 2009-5-17 12:58

ISA-based 的 NVIDIA GPU 架构估计还需要几年才能出来，指望 DX11 这代实现可能性很渺茫。

ic.expert · 发表于 2009-5-17 21:55

本帖最后由 ic.expert 于 2009-5-17 22:13 编辑

如果我们把TMU/Rasterizer/ROP等Fixed Function Unit看作Shader Unit的Peripherals，那么基于Shader Unit的GPU已经算是ISA based Architecture了。

不过未来的GPU需要更多的效率，这样Power Consuming才能得到必要的控制。没有效率的灵活性是不可取的。实际上而这种改革会涉及整个Graphics Pipeline ~~~

其实从技术角度来说，微软的DX系列是一个很不理想的平台，他封闭了其他厂商对于Graphics API Extension的能力，比如我们需要Order-independent Transparent ，如果不是DX碍事，NV也许早利用改进的F-buffer就实现了这个特性。而SM4提供了一个很不好的特性，就是毫无章法的增加Shader Program灵活性。这个结果就是目前GPU的功耗越来越高，设计越来越复杂。羊毛出在羊身上，吃亏的最后还是消费者。

所谓忠言逆耳，况且我也只是一家之言，我所说如有得罪只处，还望版座陈公演初海涵。

只看该作者 · 发表于 2009-5-18 18:27

提示: 作者被禁止或删除内容自动屏蔽

ic.expert · 发表于 2009-5-18 19:18

首先，拿GS和F-buffer比较是不适当的，第一，对于数据填充来说，GS要求Inorder，而F-buffer不需要，第二，对于数据的读取来说，GS没有限制，但Fbuffer也没有，而且在Second Pass中F-buffer在特性上可以理解为Constant Buffer，这更加灵活，而且可以利用Stream Processing的特性进行加速。这些都不会影响Perfromance。当然。Paper原始方案写F-buffer本身效率是不怎么样，不过他的改进就不一样了。比如我们可以通过Parallel-to-Serial等改进方式来弥补这一点，具体设计到的技术问题，我也不好多说。

其次，DX的Spec从来都是微软说的算，厂商只能在一些共性的问题上和微软谈判，个性的问题上，微软是不会让步的。这点参与DX10就是一个很好的例子，微软在定Spec的时候首先考虑的是OS问题，而不是Graphics问题，要不然那个什么low-latency Context Switch也就不会让当时的厂商联合抗议了。当年和微软就相关问题讨论的时候KK甚至中途退场私下找其他厂商的人去商量如何迫使微软删除掉相关特性~

最后，上面这些技术上的Detail其实不是我的主要目的，主要目的是说。设计系统首先考虑的就应该是效率，这分为目标领域中编程者的效率和程序运行的效率。其次才是灵活性或者或扩展性。微软的做法就是封闭了扩展性，这会抑制学术向工业转化的的脚步。

从工业角度来说，DX11的标准实在不够好~~ 能改进的地方太多了。只是现在的娱乐应用都不在OGL上，不然谁会去受微软这份气。

ic.expert · 发表于 2009-5-18 19:28

本帖最后由 ic.expert 于 2009-5-18 21:11 编辑

就是因为SM 4.0所以才导致了现代GPU设计就是一个怪胎。所谓的GPU上有大规模并行计算单元，其实是一种很不好的体系结构。 SM3.0 以后厂商基本上就被微软牵着鼻子走了。我经常能看到有人说NV50比NV40在隐藏延迟上作的更好，如下图

怎么评价呢，对于未来的流处理体系结构来说，这张图只能说明NV50比NV40设计的更灵活一些，但是对流计算本身的理解却更差一些，对“功耗/面积”的贡献也更差一些，对可扩展性也更差一些。可以说，比起NV40来说，NV50除了灵活性就没有好的地方了，实际上这是一种设计的倒退。而这张图恰恰说明了NV50比NV40设计最失败的所在。设计过流处理器体系结构人对此应该深有体会。

可以这么说NV40 Shader Unit是Stream Processor，而NV50的Shdar Unit无论是什么也不可能是一个Stream Proceesor，如果有人不信可以让NV把spec找出来拿到Stanford让更NB的学者评论。如果NV简直称NV50 Shader Unit比NV40更有效率，并且是Stream Processor，那估计要被学术界笑掉大牙。

ic.expert · 发表于 2009-5-18 19:33

另外，抛开ATI从小厂商那里拿到R300的初步设计框架的因素。R300本身就是照着DX9来做的，而不是DX9为了R300设计的。当年微软也是为了扶植代理人，从而以免自己的大权旁落才想出来的这么一招。很多资深一些图形领域的人都知道这件事情。

ic.expert · 发表于 2009-5-18 19:53

还有，就是一件技术究竟什么被市场接受，取决于两点，一，这个技术所解决的问题是否是很多应用所迫切需求的。二，底层器件是否允许解决这个需求。

最好的例子就是OOO在CPU上的应用，OOO的第一个实现诞生于1966年，但是为了直到1995年以后这项技术才首先被AMD通过K5引入到X86领域来？这技术那么好，为什么不早用？其实原因不外乎那两个。当然，这两个因素也是体系结构设计所尊旬的原则之一。

相同的理由也会解释类似于F-buffer（不限于F-buffer技术）这种Order-independent Transparent技术为什么不早一点应用到GRaphics HW上？技术不过是一种权衡而已，当我们的存储尺寸和存储带宽足够大的时候，我们就不需要通过Inorder Draw来解决Semi-transparent的问题了。

这并不是说技术先进了，只不过是说底层器件变了，所以我们的体系结结有能力根据目前的底层器件做出更有效率的方案。也许未来的底层器件又变了，这种方案看上去有没有效率了，于是我们就砍掉它，仅此而已。所以，从来就没有什么先进的技术，只有适用于当前环境的技术。

ic.expert · 发表于 2009-5-22 19:27

怎么这个18号的跑前面来了？奇怪……

Eji · 发表于 2009-5-23 02:59

就是因为SM 4.0所以才导致了现代GPU设计就是一个怪胎。所谓的GPU上有大规模并行计算单元，其实是一种很不好的体系结构。 SM3.0 以后厂商基本上就被微软牵着鼻子走了。
ic.expert 发表于 2009-5-18 19:28

我可以理解成"unified shader是個不好的設計"嗎。

帐号		自动登录	找回密码
密码			注册

potomac 该用户已被删除	185^# 发表于 2009-5-2 16:19 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
potomac 该用户已被删除
	回复支持反对使用道具举报显身卡

potomac 该用户已被删除	191^# 发表于 2009-5-17 12:46 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
potomac 该用户已被删除
	回复支持反对使用道具举报显身卡

RacingPHT 该用户已被删除	194^# 发表于 2009-5-18 18:27 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
RacingPHT 该用户已被删除
	回复支持反对使用道具举报显身卡

NVIDIA 下一代架构"Fermi" 猜测、讨论专题

本帖子中包含更多资源

浏览过的版块