NVIDIA 下一代架构"Fermi" 猜测、讨论专题

gaiban · 发表于 2008-9-6 10:24

原帖由 RacingPHT 于 2008-9-6 02:18 发表

G80的ALU延迟应该远大于4。
具体可以查询文档。里面提到为了避免register l/s hazard，需要的最小线程数字。
我认为4 cycle的设计只是为了简化/节约调度器的复杂度而已。

偶看对于常用计算指令:
一条算数指令的延迟是固定的，不同算数种类的指令有不同的延迟。

而主要算数指令fadd, fmul, madd,iadd等的延迟应该就是4。
例如下面有两条指令，指令2对指令1是有dependent关系的，而指令2一定是可以4个cycles后就能保证正确性了。
指令延迟说的就是dependent条件下的吞吐间隔。
除非第二条指令需要等更多cycles才能保证正确性，那才能说是延迟超过4cycle。

指令1：fadd fr0=fr1+fr2
指令2：fadd fr2=fr0+fr0

pdf里的：
Arithmetic Instructions
To issue one instruction for a warp, a multiprocessor takes:
4 clock cycles for:
single-precision floating-point add, multiply, and multiply-add,
integer add,
bitwise operations, compare, min, max, type conversion instruction;
__mul24 and __umul24 provide signed and unsigned 24-bit integer multiplication in 4
clock cycles.

gaiban · 发表于 2008-9-6 10:37

偶前面说的“wrap”，可能和wrap有点区别。 wrap好像是一段“warp”，wrap是可以有很多“wrap”，一般可每隔4cycle发射一个"wrap"，偶说的"wrap"应该是构成wrap的最小指令单位sub.wrap，或最小的wrap，仅仅含有一条指令的基本wrap。

例如下面是一个wrap，含有两个"wrap"，"wrap"--fr0=fr1+fr2，与"wrap"--fr2=fr0+fr0
fadd fr0=fr1+fr2
fadd fr2=fr0+fr0

你知道偶的意思就行了，是说一个wrap单位(wrap unit)--对应一条指令。

[ 本帖最后由 gaiban 于 2008-9-6 10:40 编辑 ]

gaiban · 发表于 2008-9-6 10:50

原帖由 RacingPHT 于 2008-9-6 02:18 发表

G80的ALU延迟应该远大于4。
具体可以查询文档。里面提到为了避免register l/s hazard，需要的最小线程数字。
我认为4 cycle的设计只是为了简化/节约调度器的复杂度而已。

说的确切点，偶看"wrap"内的"4线程"是个假象，就是用8个ALU分4批实现32路SIMD。说寄存器数量的问题，反正一个"wrap"里的寄存器本质上是一定要耗用32X8字节的寄存器空间，你去按32去除好了，你会有何发现？正好对上了。

[ 本帖最后由 gaiban 于 2008-9-6 10:52 编辑 ]

只看该作者 · 发表于 2008-9-6 12:32

提示: 作者被禁止或删除内容自动屏蔽

gaiban · 发表于 2008-9-6 15:52

原帖由 RacingPHT 于 2008-9-6 12:32 发表
"To issue one instruction for a warp, a multiprocessor takes: 4 clock cycles for:”
的意思是issue一个指令，这个和latency我认为完全没有关系。这个可以认为是throuput。

你只看到4 cycle，那么你看的很不 ...

要保证后续有register-dependent关系的wrap指令的正确执行的话，是说等24个cycles后，才能被发射？！
偶还以为，4个cycles后就可以了。要等24个cycle啊？！

gaiban · 发表于 2008-9-6 23:42

原帖由 Edison 于 2008-9-4 22:44 发表
:charles:

在 Excel 里画了一个 G80/G100 的 SIMT 执行示意图，可能还不是很准确：

根据主要是：
“ E ...

那么一个HW thread一个fiber如何跑出64个GPU threads，其中一种形式见下图

即使是有 read-after-write dependencies，也仅需要64个GPU threads就能隐藏延迟，跑满载了。

而G80是需要at least 192 active threads。

[ 本帖最后由 gaiban 于 2008-9-6 23:46 编辑 ]

jackson01 · 发表于 2008-9-7 02:22

提示: 作者被禁止或删除内容自动屏蔽

akcadia · 发表于 2008-9-9 10:50

下一代GPU必然是是在NV30的基础上修补修补，然后重用标量架构。

panjanstoneborg · 发表于 2008-9-9 11:15

为什么是nv30

Edison · 发表于 2008-9-11 11:05

Realworld Tech 的 Tesla II 体系架构报道：

http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=7

Edison · 发表于 2008-9-11 11:25

在 G100 的 "dual issue" 实现上， RWT 也有比较清晰的阐述：

Dual 'Issue'

One of the interesting complexities of NVIDIA’s microarchitecture is the relationship between latency and throughput. In general, CPUs execute most operations in a single cycle - but the latency of the fastest operation for an SP core is 4 cycles. Since the SM can issue one warp instruction every 2 'fast' cycles, it should be possible to have multiple instructions in flight at once. In fact, this ability is what NVIDIA refers to as ‘dual issue’ – although in reality it is simply parallel execution across functional units. The SP cores execute one instruction over 4 cycles, and other execution units can be in use, processing a different warp instruction simultaneously.

Figure 6 – Dual 'Issue' for NVIDIA's Execution Units

As Figure 6 illustrates, the SM can issue a warp instruction every 2 clocks. In the first cycle, a MAD is issued to the FPU. Two cycles later, a MUL instruction is issued to the SFU. Two cycles after that, the FPU is free again and can execute another MAD. Two cycles after that, the SFU is free and can begin to execute a long running transcendental instruction. Using this technique, the computational throughput of the shader core is increased by 50%, while retaining the simplicity of issuing only one warp every 2 cycles, which simplifies the scoreboarding logic. Not all combinations can be executed in parallel. For instance, the double precision unit and the single precision units share logic and cannot be active simultaneously as a result.

gaiban · 发表于 2008-9-11 12:11

原帖由 Edison 于 2008-9-11 11:05 发表
Realworld Tech 的 Tesla II 体系架构报道：

http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=7

图上很清楚，硬件还是基于SIMD的。

gaiban · 发表于 2008-9-11 12:15

原帖由 Edison 于 2008-9-11 11:25 发表
在 G100 的 "dual issue" 实现上， RWT 也有比较清晰的阐述：

Dual 'Issue'

One of the interesting complexities of NVIDIA’s microarchitecture is the relationship between latency and throughput. In g ...

cuda.pdf--是4个SP cycles发射一个wrap。RWT有笔误。
或者RWT说的"fast cycle"相当于2 SP cycles。

Edison · 发表于 2008-9-11 12:50

我记得 Issue 那里的频率是 1/2 SP。

gaiban · 发表于 2008-9-11 18:19

偶细细的想了想SIMD分支的实现问题，偶看G80的SIMD分支处理原理应该是--

举个最简单的SIMD方式下实现基础性if-else分支的例子来说明，所有if-else映射到汇编语言，实际上一般就是一条条具体的分支指令--当条件不成立时继续执行，当条件成立时跳转--或者向前跳，或向后跳转。

假定指令1-6都是一般的计算指令，分支指令branch B0为后向跳转分支(假设分支指令条件比较的结果是—有三个SP(true for SP2/6/7)需要向后跳到B0)，具体wrap指令的软件顺序如下：

…
branch B0  //分支指令--如果成立，则跳到B0，如果不成立，则继续执行(if true, then goto B0)
instruction1  //计算指令1
instruction2  //指令2
instruction3  //指令3
instruction4  //指令4
B0:       //分支入口B0
instruction5  //指令5
instruction6  //指令6
…

那么硬件处理该SIMD分支指令时，就需要指令1-6都被执行，其执行顺序(execution order)与掩盖模式如下：
…
instruction1&mask2|6|7  //执行指令1时，但掩盖(mask)SP2、SP6、SP7的计算
instruction2&mask2|6|7  //执行指令2时，但掩盖(mask)SP2、SP6、SP7的计算
instruction3&mask2|6|7  //执行指令3时，但掩盖(mask)SP2、SP6、SP7的计算
instruction4&mask2|6|7  //执行指令4时，但掩盖(mask)SP2、SP6、SP7的计算
instruction5  //正常执行指令5，即8个SP都进行计算
instruction6  //正常执行指令6，即8个SP都进行计算
…

而对前向分支的处理，可能要复杂一点。至于比较复杂的分支指令(例如使用多个独立的计数器)或多个分支嵌套的情况，则有可能出现需要很多次迭代执行，效率极低，但其基础实现思想都是类似于上面原理的。

[ 本帖最后由 gaiban 于 2008-9-11 18:27 编辑 ]

gaiban · 发表于 2008-10-1 16:25

原帖由 RacingPHT 于 2008-9-6 12:32 发表
"To issue one instruction for a warp, a multiprocessor takes: 4 clock cycles for:”
的意思是issue一个指令，这个和latency我认为完全没有关系。这个可以认为是throuput。

你只看到4 cycle，那么你看的很不 ...

关于底层是基于SIMD，nv已经确认。
而且很有类似与32路的SIMD。

偶的分析达到了无比精准的地步。

另外还有多个线程内的指令会像类似OOO-E模式的发射wrap-instruction，也说对了。

当然了，Eji大大确实很牛X。还是Eji说的对，G80是指令级的OOO-E处理器。而OOO发射的必要条件是多线程。也可以视为数据级OOO。单个线程内可能还是in-order。

[ 本帖最后由 gaiban 于 2008-10-1 16:37 编辑 ]

gaiban · 发表于 2008-10-1 16:28

G80的shader架构师与cuda的硬件架构师John Erik Lindholm准确的描述了G80的多线程SIMD硬件：
System and method for processing thread groups in a SIMD architecture
United States Patent 20070130447Kind Code:
A1

Abstract:
A SIMD processor efficiently utilizes its hardware resources to achieve higher data processing throughput. The effective width of a SIMD processor is extended by clocking the instruction processing side of the SIMD processor at a fraction of the rate of the data processing side and by providing multiple execution pipelines, each with multiple data paths. As a result, higher data processing throughput is achieved while an instruction is fetched and issued once per clock. This configuration also allows a large group of threads to be clustered and executed together through the SIMD processor so that greater memory efficiency can be achieved for certain types of operations like texture memory accesses performed in connection with graphics processing.
http://www.freepatentsonline.com/y2007/0130447.html

[ 本帖最后由 gaiban 于 2008-10-1 16:59 编辑 ]

tootoo · 发表于 2008-10-1 22:17

好多呀{sweat:]

vp · 发表于 2008-10-2 09:31

基本赞同gainban的意见，能够通过NV公布的那些东西推断到这样已经非常不容易了。NV底层应该是基于SIMD的，只是宣传口吻不通而已。

gaiban · 发表于 2008-10-3 15:35

原帖由 predaking 于 2008-10-3 12:13 发表
呵呵，拜托，用Sram去Build RF，这是大陆本科二年级的水平……

好吧，你爱怎么说就怎么说吧… 最后谢谢你对我的建议给予的评论。

算了，应该少点大嘴PK。

本着一切围绕G80技术方面的讨论精神，偶想与你讨论一下其纹理坐标产生方面的问题，说的具体点就是对纹理坐标进行透视校正插值方面，是在rasterization级产生存入pixel input buffer送给PS、还是在shader unit，还是在TMU产生的？
真正的底层shader isa代码shader binary program在sample纹理前有没有对纹理坐标插值的代码？
应该是小问题,回答一下吧{lol:]{lol:]
提示一下，有个小小的陷阱哦{lol:]{lol:]

[ 本帖最后由 gaiban 于 2008-10-3 15:40 编辑 ]

帐号		自动登录	找回密码
密码			注册

RacingPHT 该用户已被删除	124^# 发表于 2008-9-6 12:32 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
RacingPHT 该用户已被删除
	回复支持反对使用道具举报显身卡

jackson01 jackson01 当前离线积分 9 IP卡狗仔卡头像被屏蔽	127^# 发表于 2008-9-7 02:22 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
jackson01 jackson01 当前离线积分 9 IP卡狗仔卡头像被屏蔽
	回复支持反对使用道具举报显身卡

NVIDIA 下一代架构"Fermi" 猜测、讨论专题

浏览过的版块