NVIDIA Fermi Next Generation GPU Architecture Overview

yym · 发表于 2009-10-1 08:45

GT300? Yes pleaseToday we are able to reveal some of the more interesting features of [size=1em]NVIDIA’s next generation GPU architecture known internally as “Fermi”. NVIDIA refers to Fermi as the “most significant leap forward in GPU [size=1em]architecture since the original G80” and after reading through the documentation, it is hard to argue against their case. The GT200 architecture that powers the GTX 285 and GTX 295 of today was a big improvement over G80 though it was fundamentally based on the same design principles.

The massive, 3.0 billion, 512 SP Fermi core. I think I can see my house from here.

NVIDIA Fermi takes GPU computing another step forward and that is clearly the primary goal of the new architecture. We will see that NVIDIA has focused on items like double precision floating point, [size=1em]memory technologies like ECC support and caches and context switching between GPU applications to directly target its CUDA architecture and what NVIDIA believes is the future of parallel computing.

The Fermi Architecture

At a high level, the new Fermi architecture was designed to map directly to NVIDIA’s interpretation of CUDA computing going forward. In this program execution model there are threads, thread blocks, and grids of thread blocks that all differentiate themselves based on memory access and kernel execution.

A thread block is a group of threads that have the ability to cooperate with each other and communicate via the per-Block shared memory. Each block supports as many as 1536 concurrent threads, each of which has separate access to individual memory, counters, registers, etc. Each grid is actually then an array of thread blocks that are running the same kernel but have the ability to read and write from global memory (but only after kernel-wide global synchronization).

These software stacks match up with NVIDIA architecture in the form of the GPU, streaming multiprocessors and CUDA cores. The GPU itself operates on the grids of thread blocks, each array of streaming multiprocessors (SMs) executes one or more thread blocks and the individual CUDA cores (as NVIDIA is calling them now) execute the threads. The SMs execute threads in groups of 32 called “warps” that help to improve efficiency of the GPU.

The first implementation of this architecture, that we are tentatively calling GT300, will have some impressive raw specifications. The GPU is made up of 3.0 billion transistors and features 512 CUDA processing cores organized into 16 streaming multiprocessors of 32 cores each. The memory architecture is built around a new GDDR5 implementation and has six channels of [size=1em]64-bits for a total memory bus of 384-bits. The memory system can technically support up to 6GB of memory as well – something that is key for HPC applications.

Each SM includes 32 CUDA processing cores (4x the previous GT200 design) as you can see above but also introduces other new features to help improve performance. Each processor includes a fully pipelined integer and floating point unit that implements the newer IEEE 754-2008 standard – another important move for GPU computing. The new Evergreen core from [size=1em]AMD also implements this standard as it adds support for the fused multiply-add instruction.

Also included in each SM are 16 load and store units and 4 special function units to handle calculations like sin and cosine.

NVIDIA is claiming that the double precision performance of the Fermi architecture will be greatly improved over the existing GT200 design.

With NVIDIA claiming to be 4.25x faster than GT200, that puts the GT300 at about 330 GFLOPS of double precision performance (based on the 78 GFLOPS the GT200 rests at). (UPDATE: During Jen-Hsun's keynote at the NVIDIA GPU Tech Conference, they stated the peak DP performance increase was "8x". If that's the case, GT300 could reach as high as 624 GFLOPS. We will find out the final answer soon.) While definitely an impressive improvement, AMD’s new Evergreen family reaches a theoretical peak of 544 GFLOPS of double precision performance, so we definitely need to keep an eye on these numbers as we see actual [size=1em]hardware from NVIDIA hit the streets.

Fermi Architecture continuedYou heard me previously mention the “warps” – the groupings of 32 threads that a single SM will process.

Each SM features a pair of warp schedulers and instruction dispatch units that allow two warps to be executed concurrently on the CUDA cores. Each warp assigns instructions to 16 of the cores and 16 of the load/store units and half of the special function units – the warps then execute independently without scheduler assistance. This [size=1em]model of dual-issue architecture will apparently allow Fermi to reach close to its theoretical performance limits.

In our SM diagram above you also can see a block of 64KB of shared [size=1em]memory and L1 cache. This memory is unique in that it is configurable either as 48KB of share memory and 16KB of L1 cache or as 16KB of shared memory and 48KB of L1 cache. This option was required to guarantee 100% backwards [size=1em]compatibility with existing GPU-based applications but it also provides flexibility to the developer based on their programs’ needs.

Here you can see a specification breakdown of the new GPU architecture compared to G80 and GT200.  At this time [size=1em]NVIDIA is not making any claims against current or upcoming AMD designs, though whether that is because NVIDIA would not look favorable or because the company is simply taking the higher ground has yet to be seen.

Besides these raw compute capabilities, there are some new features that NVIDIA is hoping will help Fermi differentiate from the competition.  The first is a new [size=1em]ISA (instruction set architecture) that is updated to support the most popular programming language today: C++.  By including support for a unified address space, NVIDIA’s architecture can now support object-oriented programming [size=1em]models with unlimited and unrestricted pointer locations.  This feature alone could draw a lot of developers into the world of CUDA and GPU computing.

NVIDIA was quick to point out that this new ISA and architecture in general is completely ready for OpenCL and DirectCompute.  The sharing of key abstractions like threads, blocks and grids are key to the optimization for these upcoming compute languages.

The new parallel thread execution model implements improved branching support through predication.  By basically looking ahead into the branching code (if-else), Fermi is able to improve performance of both gaming and GPU computing code.  This feature sounds very similar to the branch prediction units that AMD implement on their GPUs a couple of generations ago.

Memory Subsystem Innovations

While we have already discussed the benefits of the 64KB of shared memory/L1 cache, there are other changes that NVIDIA has made with Fermi to improve computing performance.

Applications that benefit from additional shared memory will have that option, up to 48KB, but will still have access to the L1 cache that is unique to this design.  The L1 will stores temporary register spills and thus can improve overall memory access time.

NVIDIA also included a new L2 cache of 768KB that is shared and coherent across all 16 SMs in the GPU.  The L2 cache then improves communication between the various SMs for applications that span more a single set of 32 CUDA cores.

NVIDIA has also taken the step to implement all major internal memories with support for ECC.  While not a consumer-based issue, for very large server processing farms that have to worry about single bit-flips due to random radiation, ECC is a key component of a stable environment.    The GDDR5 memory controller supports ECC as do the internal registers, L1 and L2 caches.

GigaThread Scheduler

The updated thread scheduler offers two new features with Fermi worth discussing.  The first is vastly improved context switching performance – down to as little as 10-20 microseconds.  Context switching is used when the GPU needs to swap between applications; for example switching between [size=1em]graphics rendering and PhysX processing.  This could allow for developers to use more of the GPU compute power for non-graphics purposes if the penalty for doing so is reduced from a performance perspective.

The second major update is with concurrent kernel execution which I like to think of as HyperThreading for the GPU.

This allows a program that only uses a small number of kernels (and thus SMs and CUDA cores) to better utilize the entire GPU by running multiple instances of the kernel simultaneously. For this to work the kernels need to be based on the same GPU context so you would not be able to run both graphics and PhysX processing in this example.

Final Thoughts

NVIDIA has shown only the first taste of its new Fermi architecture to us today and it claims to have radically adjusted the GPU’s role, purpose and capability. NVIDIA did not just add new execution units to the core (though they did do that) but also took the route to improve performance with newer memory hierarchies, a configurable L1 cache, global L2 cache and ECC support. Double precision performance gets a big boost over the GT200 design though we have yet to decide how well it will compete with AMD’s Evergreen in raw compute.

NVIDIA CEO holds up the first Fermi reference card

NVIDIA also continues to push its CUDA architecture and support for other programming models besides DirectCompute and OpenCL. It would be hard to deny that NVIDIA has had success with its proprietary CUDA architecture in the professional and academic worlds, if not with the consumer. Adding support for the C++ programming model will only further drive the NVIDIA architecture into this market.

Since this is a Tesla card, only have one video output is not a big deal.

From a gaming angle, which is obviously one of our primary targets at PC Perspective, we don’t yet know how the Fermi architecture will apply. While I am doubtful that NVIDIA will be sharing any information about new products, frequencies, etc during the GPU Tech Conference today, if we find anything out we will be sure to share it. But even if clock rates remain the same as we currently have on the GT200 the architecture should perform damn well – after all we moved from 240 SPs to 512 SPs and have a new GDDR5 memory bus that is 384-bits wide. Everything else at this point is up in the air.

We also don’t know how soon anyone, gamers or professionals, will actually get hardware based on the Fermi architecture. If the persistent rumors are correct we are still looking at early 2010 for hardware – does that make this new design a “paper launch”? More or less, but as a journalist and fan of technology I would rather have this type of information earlier rather than later.

voodoo12345 · 发表于 2009-10-1 08:53

有演示demo吗？

lanyan3232 · 发表于 2009-10-1 08:54

提示: 作者被禁止或删除内容自动屏蔽

pangauto · 发表于 2009-10-1 08:56

amd继续沦陷.....

isastar · 发表于 2009-10-1 08:59

有测试成绩么？

HuaErZ · 发表于 2009-10-1 09:11

本帖最后由 HuaErZ 于 2009-10-1 09:26 编辑

看来GT300是主要针对intel larrbee设计了,在专业领域,AMD的产品可以不用出来混了.

big-eblis · 发表于 2009-10-1 09:22

终于看见卡的外观了，支持
GTX295购入计划可以停止了，等GTX380好了

骨刺 · 发表于 2009-10-1 09:32

小小的支持一把是在对AMD的驱动无语

天下18 · 发表于 2009-10-1 12:03

提示: 作者被禁止或删除内容自动屏蔽

eviano · 发表于 2009-10-1 12:13

那384bit看上去好象被阉过一样，严重不爽

天下18 · 发表于 2009-10-1 15:36

提示: 作者被禁止或删除内容自动屏蔽

只爱美女 · 发表于 2009-10-1 17:49

"Since this is a Tesla card, only have one video output is not a big deal."
难道Fermi卡不能原生支持HDMI？！

赫连勃勃大王 · 发表于 2009-10-1 18:14

支持HDMI有毛用？

治病救人 · 发表于 2009-10-2 10:37

提示: 作者被禁止或删除内容自动屏蔽

jocover · 发表于 2009-10-2 10:45

"Since this is a Tesla card, only have one video output is not a big deal."
难道Fermi卡不能原生支持HDMI？！
只爱美女发表于 2009-10-1 17:49

什么叫原生支持HDMI？
是不是有HDMI接口就算原生支持了？

leorex · 发表于 2009-10-2 11:08

看上去不短，赶紧出才是王道

帐号		自动登录	找回密码
密码			注册

lanyan3232 lanyan3232 当前离线积分 15 IP卡狗仔卡头像被屏蔽	3^# 发表于 2009-10-1 08:54 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
lanyan3232 lanyan3232 当前离线积分 15 IP卡狗仔卡头像被屏蔽
	回复支持反对使用道具举报显身卡

天下18 天下18 当前离线积分 24 IP卡狗仔卡头像被屏蔽	9^# 发表于 2009-10-1 12:03 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
天下18 天下18 当前离线积分 24 IP卡狗仔卡头像被屏蔽
	回复支持反对使用道具举报显身卡

天下18 天下18 当前离线积分 24 IP卡狗仔卡头像被屏蔽	11^# 发表于 2009-10-1 15:36 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
天下18 天下18 当前离线积分 24 IP卡狗仔卡头像被屏蔽
	回复支持反对使用道具举报显身卡

治病救人治病救人当前离线积分 11 IP卡狗仔卡头像被屏蔽	14^# 发表于 2009-10-2 10:37 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
治病救人治病救人当前离线积分 11 IP卡狗仔卡头像被屏蔽
	回复支持反对使用道具举报显身卡

NVIDIA Fermi Next Generation GPU Architecture Overview

浏览过的版块