NVIDIA 下一代架构"Fermi" 猜测、讨论专题

predaking · 发表于 2008-11-17 14:37

原帖由 gaiban 于 2008-11-14 23:53 发表
把fragment color像素颜色的插值都用shader program来计算。把纹理坐标的透视修正插值用shader program来计算。某些人又要抓狂了。

X,Y是SETUP/RSATER单元计算得到。而Z是在z-culling协处理器里计算，先计算16X ...

赫赫，反正无论这个是不是NV的产品，做为商业产品，这种改进可以理解。毕竟要保证公司在技术积累上的优势。

但是做为Research，这种改进毫无创新可言，都将被历史甩弃……

ps：这次革命要架空的不仅仅是CISC阵营，而是普林斯顿体系结构。其实之前哈佛和OOO已经在一定程度上架空了他，但是这次……，等着看结果吧。

[ 本帖最后由 predaking 于 2008-11-17 16:58 编辑 ]

insect2006 · 发表于 2008-11-17 18:20

下一代游戏主机应该基于GT300核心或者RV900系列吧

G81 · 发表于 2008-11-27 20:59

原帖由 insect2006 于 2008-11-17 18:20 发表
下一代游戏主机应该基于GT300核心或者RV900系列吧

据小道消息称,PS4可能不会用N卡了

32nm · 发表于 2008-11-29 12:38

NVIDIA的下一代GPU会不会把底层也改为MIMD，变成一个彻头彻尾的MIMD架构?o:) 而不是象G8X/G9X/GT2XX的底层是SIMD的，但是是gather-scatter的SIMD

gzeasy2006 · 发表于 2008-11-29 20:52

次世代顶一下:p

lffuc · 发表于 2008-11-30 03:07

驱动也，基于显卡的操作系统和开发工具。NV把软件做好了，还是有作为啊。就怕斗不英-微软。

R620 · 发表于 2008-11-30 22:00

原帖由 lffuc 于 2008-11-30 03:07 发表
驱动也，基于显卡的操作系统和开发工具。NV把软件做好了，还是有作为啊。就怕斗不英-微软。

除非微软铁了心要帮nv,否则nv很难成功.

Edison · 发表于 2008-12-1 15:22

SC08 上有一篇 University of California at Berkeley 的 "Benchmarking GPUs to Tune Dense Linear Algebra" 论文，其中的不少内容看来都是相当有趣，摘录如下：

对于 NV50 的 Strip Mining 向量执行方式，文章有这样的看法：

"Partitioning of long vectors into warps by the GPU environment corresponds to strip mining into independent instruction streams. This is an alternative to the more traditional strip mining into independent instructions in the same instruction stream. For example, an operation on a 512-element vector on a machine with VL = 32 is traditionally performed as 16 independent vector instructions. The GPU allows (but not requires) distributing these 16 independent instructions across 16 instruction streams. This is done to improve performance in branching — associating an individual program counter with a short subset of a longvector allows skipping branches not taken by this subset rather than masking them off.

However, strip mining into independent instruction streams is expensive as it requires replicating register data across all instruction streams in the thread. For example, a program operating on 512-element vectors consumes 2KB of register file per every pointer, temporary value or scalar value defined in the scalar thread as a 32-bit register variable.

Another associated overhead is the partitioning of the register data into private register spaces associated with different instruction streams in the thread. Accessing the data residing in the register space of another warp requires staging it via the local store, which incurs costs.

Note that the number of independent instructions supplied by a program does not depend on the kind of strip mining used.

Whether independent instructions come in the same or different streams, they hide memory and pipeline latencies. "

在 "Kernel Launch Overhead" 方面，文章作者测试结果如下：

"The minimum time to asynchronously invoke a GPU kernel using either the low-level or the high-level CUDA API was 3–7 μs across a variety of systems equipped with different GPUs, operating systems and CUDA versions. This was measured by asynchronously invoking the same kernel a very large number of times and synchronizing once at the end. The program used was the simplest possible, such as copying one word from one location in the GPU memory to another. This ensures that the program runtime does not contribute substantially to the overall time. The time increases to 10–14 μs when synchronizing at each kernel invocation. This shows the expense of synchronization.

To ensure that we do not sacrifice performance by choosing CUDA for programming the GPU we also measured the overheads in DirectX 9.0c, which is a mature graphics API widely used in computer games. The timings were 7–8 μs for invocation alone and 20–23 μs for invocation with synchronization (synchronization is required when computing with DirectX to ensure correctness, but not in CUDA). This indicates that CUDA is as efficient as or better than DirectX in terms of the
launch overhead."

在 "CPU-GPU Data Transfers" 方面，有这样的测试结果：

"Our primary system is equipped with a PCIe 1.1 x16 interface that bounds the bandwidth of the CPU-GPU link by 4 GiB/s. We found that transferring contiguous pieces of data with sizes from 1 byte to 100 MB long across this link using pinned memory takes about:

Time=11 μs+[(bytes transferred)/3.3GB/s]

This fits the measured data within a few percent. Similar fitting on other systems yielded similar accuracy with different numbers, such as 10–17 μs overheads and 2.2–3.4 GB/s bandwidths.

When using two GPUs in the system, transfer to the second GPU run only at up to 1.8GB/s, i.e. about what one may expect from PCIe 1.1 x8. This result was obtained when using various x16 slots in the nForce 680i SLI motherboard.

Operating with two GPUs concurrently poses new difficulties. CUDA requires attaching each CPU thread to a fixed GPU context, so multiple CPU threads must be created. According to our experience, pinning of memory is effective only with the GPU context that performed the memory allocation. Other GPU contexts perform at non-pinned rates when operating with this memory space. So, if two GPU contexts run transfers across the same main memory locations, at least one of the contexts will run at the non-pinned transfer rate, which is about 2x lower.

Benchmarks on a few different machines with PCIe 2.0 x16 have shown 3.9–6.1 GB/s transfer rates."

Edison · 发表于 2008-12-1 15:33

Memory latency as revealed by the pointer chasing benchmark on GeForce 8800 GTX for different kinds of memory accesses.

The data in Fig. 1 suggests a fully associative 16-entry TLB (no TLB overhead for 128MB array, 8MB stride), a 20-way set associative L1 cache (20KB array at 1KB stride fits in L1), and a 24-way setassociative L2 cache (back to L2 hit latency for 768KB array, 32KB stride). These are the effective numbers and the real implementation might be different. Six 4-way set-associative L2 caches match this data as well.

According to this data, L1 cache has 160 cache lines only (in 8 fully associative sets). This promises a 100% miss rate in every cached access unless scalar threads are sufficiently coordinated to share cache lines.

Fig. 1 also reveals a 470-720 cycle latency non-cached memory access that roughly matches the official 400-600 cycle figure [NVIDIA 2008a, Ch. 5.1.1.3]. To find the total amount of the partitioned cache memory, we run a multithreaded test that utilizes all cores. We run one thread per core (this is enforced by holding a large amount of shared memory per thread), each traversing through a private array so that their working sets do not overlap. The results match the official data, with the effective size of L1 cache scaling with the number of cores. Effective L2 cache size did not scale.

Fig. 2 summarizes the parameters of memory system of 8800GTX including the findings cited above. Preliminary study shows that TLB also scales with number of cores.

Latency to the shared memory is an order of magnitude less than to the cache——36 cycles. We’ll see shortly that it is close
to the pipeline latency.

Edison · 发表于 2008-12-1 15:52

在 "Pipeline Latency" 方面：

To measure pipeline latency we execute dependent operations such as a = a * b + c or a = log2 |a| many times in an aggressively unrolled loop, one scalar thread per entire GPU. (We assume that similarly to as done on AMD GPUs [AMD 2006], taking absolute value of an argument does not require a separate instruction.)

We used decuda to ensure that this operation maps to a single native instruction for all but double precision tests which are not supported by this tool. We made sure that arithmetic does not overflow, but assume that execution units are not optimized for special values of operands, such as 0 or 1. The following table lists the average time per instruction in cycles for GPUs in Table 1 (decimal fractions are not shown but are about 0.1):

For example, the register-to-register multiply-and-add instruction runs at 24 cycles throughput per instruction. This number is 6x larger than at the peak throughput and is an estimate of the pipeline latency. 24 cycle latency may be hidden by running simultaneously 6 warps or 192 scalar threads per vector core, which explains the number cited in the CUDA guide [NVIDIA 2008a, Ch. 5.1.2.6]. Note, that 6 instruction streams is the largest number that may be required to hide this latency.

Smaller number may also be sufficient if instruction level parallelism is present within the streams. This is an example where strip mining into same or independent warps makes no difference.

Latency of SP and SFU pipelines is similar. Latency of the double precision pipeline is substantially larger than in single precision. However, less parallelism is needed if overlapping it with other double precision instructions as they run at low throughput.

bessel · 发表于 2008-12-15 02:26

谢谢E大指出了这片文章，是很有趣。

原帖由 Edison 于 2008-12-1 15:22 发表
SC08 上有一篇 University of California at Berkeley 的 "Benchmarking GPUs to Tune Dense Linear Algebra" 论文，其中的不少内容看来都是相当有趣，摘录如下：

对于 NV50 的 Strip Mining 向量执行方式，文章有 ...

[ 本帖最后由 bessel 于 2008-12-15 02:44 编辑 ]

G81 · 发表于 2008-12-22 17:31

喜欢CUDA的朋友注意了，CUDA3.0明年年底登场'''''''{time:]

ff10fans · 发表于 2009-1-30 07:21

不知下一代的功耗会是多少？

Edison · 发表于 2009-4-23 12:59

在最近的一次 CUDA 讲座上，NVIDIA 首次公开提出了 Partition Camping 方面的问题，见 page 46：

http://www.cse.unsw.edu.au/%7Epl ... mizingCUDA_full.pdf

Effective Bandwidth (GB/s), 2048x2048, GTX 280
Simple Copy : 96.9
Shared Memory Copy : 80.9
Naïve Transpose : 2.2
Coalesced Transpose : 16.5
Bank Conflict Free Transpose : 16.6
Diagonal : 69.5

万里乌云 · 发表于 2009-4-23 18:05

高深呀学习下。。。。。。。。。

ic.expert · 发表于 2009-4-23 20:47

这东西无非就是把Memory Controller暴露给User了，让User利用Multi-channel Memory Controller来优化自己的Data Structure。从而更加有效的开发 memory Access bandwidth。对高性能计算来说，这实际上是一种退化……好的机制是要做到尽量对user透明。

Edison · 发表于 2009-4-23 23:55

这个没办法吧，如果写一个 cpu 上执行的程序，需要大规模的数据传输，并且是同一个 mc 对应位置的，应该也是会有非常糟糕的性能结果出现。

zyh1228 · 发表于 2009-4-24 00:38

高人还是多的

ic.expert · 发表于 2009-4-24 01:06

在大型机的时代里我们一直都在研究相关的技术，从UMA到NUMA 、CC-NUMA、SMP、MPP 等等。你所说的Load Balance的情况在顶层我们需要协调，但是底层不应该暴露给程序员。未来高级并行语言应该是可以按照语法的规范静态调节并行性的并且通过一些硬件支持来动态调节，唯独不应该让程序员看到的太多。程序员应该回归到算法本身。

ic.expert · 发表于 2009-4-24 01:10

现在的所谓高级程序语言都是很“低级”的，别的不说，我们在做数学题的时候会考虑把“变量”放在哪么？这种和计算本身没有关系的事情却成为了是否是资深程序员的必要条件，这本身就是一件很荒唐的事情。

帐号		自动登录	找回密码
密码			注册

NVIDIA 下一代架构"Fermi" 猜测、讨论专题

本帖子中包含更多资源

本帖子中包含更多资源

浏览过的版块