POPPUR爱换

标题: NVIDIA Tesla GPU computing solutions for HPC will be available August, 2007. [打印本页]

作者: Edison 时间: 2007-6-21 10:17
标题: NVIDIA Tesla GPU computing solutions for HPC will be available August, 2007.
NVIDIA、G80ベースのHPC向けGPU「Tesla」
～PCI Expressカードタイプから1Uラックまで

Tesla GPU Serverを紹介する同社チーフサイエンティストのデイビッド・カーク氏

6月20日(現地時間)発表

　米NVIDIAは20日(現地時間)、HPC (High Performance Computing)向けのGPU「Tesla」(テスラ)を発表した。

GeForce、Quadroに次ぐ第3のGPUブランドとなるTesla

　コンシューマ向けの「GeForce」、プロフェッショナル向けの「Quadro」に次ぐ第3のGPUブランド。ただし、その用途はグラフィックスではなく、大規模な並列計算用のコプロセッサ的製品となる。

　ラインナップは、PCI Expressカード型の「Tesla GPU (C870)」、このカードを2枚内蔵した外付けボックス型の「Tesla GPU Deskside Supercomputer (D870)」、カードを4枚内蔵した1Uラックマウント型の「Tesla GPU Server (S870)」の3モデル。価格は順に1,499ドル、7,500ドル、12,000ドル。

　C870のハードウェアはGeForce 8800 GTXとほぼ同じで、GPU内に128個のストリーミングプロセッサ(SP)を内蔵し、PCとはPCI Express x16で接続。ただし、DVIなどのディスプレイインターフェイスは持たず、メモリは1.5GBを搭載する点が異なる。クロックは明らかにされていないが、ピーク性能は518GFLOPSに達するという。

　D870とS870はPCI Express Gen 2スイッチを搭載しており、ホストPCにPCI Express Gen 2アダプタを装着して、外部ケーブルで両者を接続する。なお、両製品ともそれぞれ2倍のGPUを搭載する製品も計画している。

　電源容量はD870が最大550W、S870が最大800W。S870はGPU上にファンレスのヒートシンクのみを搭載し、シャーシ前面に備え付けられたファンで冷却する。


Teslaのロゴ	Tesla GPU (C870)	基板にパターンは残っているが、ディスプレイ端子はない。利用できるのかは不明だが、SLIコネクタは1つのみ

Tesla GPU Deskside Supercomputer (D870)	ホストPCとはPCI Express Gen2アダプタ経由でケーブル接続する	4枚のC870を内蔵したTesla GPU Server

　Teslaのソフトウェアプラットフォームは同社の汎用プログラミングモデル「CUDA (Compute Unified Device Architecture)」を利用。CUDAにはGPU用のCコンパイラが含まれており、Cプログラムに若干の修正を加えるだけで、CUDAコンパイラが処理をCPUとGPUに振り分けられる。

　同社GPU Computingジェネラルマネージャのアンディー・キーン氏は「これまでHPCの歴史では、SIMD、マルチCPU、パラレルCPU、クラスタ化というようにCPUを主体に処理能力を上げてきた。しかし、浮動小数性能はCPUよりGPUの方がずっと高く、パラレルデータ処理能力にも秀でている。そこで、GPUに対して一般的な言語によるアクセスを与えることで、GPUの役割を広げることができる」と、Teslaの開発経緯を語っている。

　また、同社CEOのジェンスン・フアン氏はTeslaについて「科学者が待ち望んでいたパーソナルスーパーコンピュータ」と表現している。

http://pc.watch.impress.co.jp/docs/2007/0621/nvidia.htm

作者: Edison 时间: 2007-6-21 10:21

NVIDIA® Tesla™ C870 GPU computing processor is the first to bring a massively multi-threaded architecture to high performance computing (HPC) applications for scientists, engineers, and other technical professionals.
The Tesla C870 GPU computing processor transforms a standard system into a personal supercomputer with over 500 gigaflops of peak floating point performance.

With a 128-processor computing core, a C-language development environment for the GPU, a suite of developer tools, and the world’s largest ISV development community for GPU computing, the Tesla C870 GPU computing processor enables professionals to develop applications faster and to deploy them across multiple generations of processors.

The Tesla C870 GPU computing processor can be used in tandem with multi-core CPU systems to create a flexible solution for personal supercomputing.

Product	Tesla C870
Form Factor	ATX, 4.38" x 12.28"
# of Tesla GPUs	1
Total Dedicated Memory	1.5 GB GDDR3
Peak Flops	Over 500 gigaflops
Floating Point Precision	IEEE 754 single-precision floating point
Memory Interface	384-bit
Memory Bandwidth	76.8 GB/sec.
Max Power Consumption	170W
System Interface	PCI Express x16
Auxiliary Power Connectors	Yes (2)
Number of Slots	2
Thermal Solution	Active Fansink

作者: Edison 时间: 2007-6-21 10:22
Supporting Platforms

NVIDIA® Tesla™ certified system*
Microsoft® Windows® XP (32-bit)
Linux® (64-bit and 32-bit)
- Red Hat Enterprise Linux 3, 4 and 5
- SUSE 10.1, 10.2 and 10.3

NVIDIA Tesla Architecture

Massively-parallel computing architecture with 128 multi-threaded processors per GPU
Scalar thread processor with full integer and floating point operations
Thread Execution Manager enables thousands of concurrent threads per GPU
Parallel Data Cache enables processors to collaborate on shared information at local cache performance
Ultra-fast memory access with 76.8 GB/sec. peak bandwidth per GPU
IEEE 754 single-precision floating point

Scalable Solutions

Scalable from one to thousands of GPUs
Available in GPU computing processor, deskside supercomputer and 1U rack-mount GPU computing server

Software Development Tools

C language compiler, profiler and emulation mode for debugging
Standard numerical libraries for FFT (Fast Fourier Transform) and BLAS (Basic Linear Algebra Subroutines)

Product Details

Tesla C870 GPU Computing Processor
- One GPU (128 thread processors)
- Over 500 gigaflops
- 1.5 GB dedicated memory
- Fits in one full-length, dual slot with one open PCI Express x16 slot
Tesla D870 Deskside Supercomputer
- Two GPUs (128 thread processors per GPU)
- Over 500 gigaflops per GPU
- 3 GB system memory (1.5 GB dedicated memory per GPU)
- Quiet operation (40dB) suitable for office environment
- Connects to host via cabling to a low power PCI Express x8 or x16 adapter card
- Optional rack mount kit
Tesla S870 GPU Computing Server
- Four GPUs (128 thread processors per GPU)
- Over 500 gigaflops per GPU
- 6 GB of system memory (1.5 GB dedicated memory per GPU)
- Standard 19”, 1U rack-mount chassis
- Connects to host via cabling to a low power PCI Express x8 or x16 adapter card
- Standard configuration: 1 PCI Express connector driving 4 GPUs
- Optional configuration: 2 PCI Express connectors driving 2 GPUs each

*for deskside system and server

作者: ikinari 时间: 2007-6-21 10:55
提示: 作者被禁止或删除内容自动屏蔽

作者: fineday 时间: 2007-6-21 12:06
500GFlops+？
哪儿来的。

作者: Eji 时间: 2007-6-21 12:21
C870有個有趣的地方，它的PCB是全長的。
那個box本身是PCI Express Gen2的關係，所以1U server可以裝4張卡。

518GFLOPS全開.... 看來MUL在A3已經搞定了，只是Driver封閉起來....orz

作者: iiiiuuuu 时间: 2007-6-21 12:25
没有DVI接口，不能当显卡用了

作者: fineday 时间: 2007-6-21 12:31

原帖由 Eji 于 2007-6-21 12:21 发表
C870有個有趣的地方，它的PCB是全長的。
那個box本身是PCI Express Gen2的關係，所以1U server可以裝4張卡。

518GFLOPS全開.... 看來MUL在A3已經搞定了，只是Driver封閉起來....orz

:mad:强烈BS之。

至少怎么说，Ultra应该开启才对。

作者: Eji 时间: 2007-6-21 12:43

原帖由 fineday 于 2007-6-21 12:31 发表

:mad:强烈BS之。

至少怎么说，Ultra应该开启才对。

開了也不一定可以好好利用.... 這和R600是一樣的狀況。
所以他們乾脆限制在Stream Processor產品底下才開....
G86可以開、但G84不能開的狀況可能也是G86實在太弱了不開不行。

而且可能會產生G80前後期性能不同的狀況(這應該真的是A3才改的)，那不如就限制起來....雖然Ultra應該全部都是A3。orz

看看八月Tesla正式推出的時候會不會順便改這邊吧....

[ 本帖最后由 Eji 于 2007-6-21 12:46 编辑 ]

作者: cool_exorcist 时间: 2007-6-21 12:59
楼上都是强人，我基本上看不懂，就知道那卡不能拿来当显卡用，因为没有输出端口。这个是所谓的通用计算用的卡吧

作者: Edison 时间: 2007-6-21 13:08
:charles: 上大图：
[attach]757900[/attach]
[attach]757901[/attach]
[attach]757902[/attach]

作者: Eji 时间: 2007-6-21 13:43
話說回來，GPGPU的programming style長期以來限制於GPU的結構，所以GPGPU的developer通常被迫要去認識GPU的結構，但是CELL、Larrabee的developer應該不會有這個困擾.... AMD的CPU/GPU抽象層有沒有辦法改善這個問題需要觀察，但是原則上AMD可以放棄GPU效率來強化CPU能力、NVIDIA卻不能放棄GPU結構，因為這是他們的主力產品。

所以個人認為，NVIDIA在HPC市場算是最吃虧的，其次是AMD。
(看AMD要不要讓GPU的性能下降來迎合CPU這塊....現在越來越看不出來；做兩種結構？公司要有那個資源啊)

正常狀況下，HPC市場只會剩下CELL和Larrabee對抗....

[ 本帖最后由 Eji 于 2007-6-21 13:55 编辑 ]

作者: 来不及思考 时间: 2007-6-21 15:37
提示: 作者被禁止或删除内容自动屏蔽

作者: lqf3dnow 时间: 2007-6-21 15:43
这东西和CUDA有何区别？

作者: Eji 时间: 2007-6-21 16:01

原帖由 lqf3dnow 于 2007-6-21 15:43 发表
这东西和CUDA有何区别？

拔掉輸出(NVIO空焊)所以沒有顯示能力，不使用SLI，結構面上和Quadro Plex一樣只是單純的外接盒。和AMD Stream Processor幾乎等於拔掉顯示接頭的R580一樣。記憶體容量1.5GB，full-length PCB。所以G80並不是沒有full-length PCB....

值得注意的是會場NVIDIA的人員提到會有雙GPU版本，不知道是Tesla專用還是真的會有8950GX2。

----
話說看到全長PCB就會想要全長版G80的只有我一個人嗎？

[ 本帖最后由 Eji 于 2007-6-21 16:23 编辑 ]

作者: Edison 时间: 2007-6-21 16:45

w00t)

特斯拉. 特斯拉 Nikola Tesla (1856-1943) was born in Croatia and immigrated to America. He contributed to the development of electrical technology. Here he is displayed on black and white scan of a 10 Billion Dinar note from the period of the great inflation just before the breakup of Yugoslavia. (The Europeans call it 10 Milliard. In any language that's 1010! A good reason for using scientific notation.)

作者: zzhang 时间: 2007-6-21 17:03
太震撼了，不过不知道有没有什么场合能直接利用4-way的C870，有相关的应用程序吗？

作者: Eji 时间: 2007-6-21 17:16

原帖由 zzhang 于 2007-6-21 17:03 发表
太震撼了，不过不知道有没有什么场合能直接利用4-way的C870，有相关的应用程序吗？

自己寫吧.... XD

由於CUDA設定上對每個GPU都會分一個host thread，所以UIUC的課程裡面是拿Quad-Core去搭三張G80，對每個GPU都讀自訟Data Stream。所以相信4way G80應該需要類似的設計。
話說外接應該只是穩定需求而已...它是拿一張轉接卡插進PC的PCI Express x16，對系統來說沒有太大差異。

作者: zzhang 时间: 2007-6-21 17:35
不过我觉得这种方式的一个致命缺陷是功耗太高，计算能力/功耗比太差了，还是AMD和Intel那种集成多核心的方案更合适，用造CPU的方式造GPU，:p

作者: Edison 时间: 2007-6-21 17:39
G80本身就是多核心设计，128个SP可以看作是128个完整的内核。

作者: zzhang 时间: 2007-6-21 17:47

原帖由 Edison 于 2007-6-21 17:39 发表
G80本身就是多核心设计，128个SP可以看作是128个完整的内核。

这个和Intel的多核心GPU设计以及Cell的多核心不是一个层次上的概念吧。感觉Intel的方案里每个核心就是一个微型的G80这样的东东。

作者: fineday 时间: 2007-6-21 17:58

原帖由 Eji 于 2007-6-21 12:43 发表

開了也不一定可以好好利用.... 這和R600是一樣的狀況。
所以他們乾脆限制在Stream Processor產品底下才開....
G86可以開、但G84不能開的狀況可能也是G86實在太弱了不開不行。

而且可能會產生G80前後期 ...

:p 当然这是实话
不过我觉得如果MUL的问题是制造过后才发现的，那么应该还是有理由会开启的。

作者: aibo 时间: 2007-6-21 19:15

原帖由 fineday 于 2007-6-21 17:58 发表

:p 当然这是实话
不过我觉得如果MUL的问题是制造过后才发现的，那么应该还是有理由会开启的。

觉得最大的理由是amd的威胁，可是现在这个理由不成立。而且要是现在打开了，多少会对自己后一代的产品有些威胁。

作者: Eji 时间: 2007-6-21 21:42

原帖由 zzhang 于 2007-6-21 17:35 发表
不过我觉得这种方式的一个致命缺陷是功耗太高，计算能力/功耗比太差了，还是AMD和Intel那种集成多核心的方案更合适，用造CPU的方式造GPU，:p

要算算嗎？

上面G80的spec是去掉MUL的，如果不去掉mul的話，SP flops per watt則會變成2.9GFLOPS、per mm^2則是1.08GFLOPS，所以論die size效率的話雖然不見得最高，但是功耗效能比卻是最高的。
可以看到，雖然die size因為DX10對應與Unified Shader結構而大幅成長，但其實G80的TDP控制非常驚人，耗電量與性能比例是目前最佳的。
R600現在的理論數字其實與Tesla規格下的G80並沒有多少差異(都大約500GFLOPS上下)，R600有80nm帶來的per mm^2優勢(約1.14GFLOPS)，但耗電量性能比就輸掉了。

Larrabee和Fusion會因為x86多耗的電還不知道會多少，目前看來真的確實能贏G8x的，只有同是專用ISA的CELL SPE了吧。

[ 本帖最后由 Eji 于 2007-6-22 04:35 编辑 ]

作者: Edison 时间: 2007-6-22 10:53
We're basically just fishing to see if you'll support GPU computing equally across the entire business, not just with the big guys using Tesla. Are you going to look after everybody.Absolutely, yes, and that decision is made by business unit and how they want to invest in their customer base and their ISV base to allow that to happen and opportunities to happen.
So let me answer part of the underlying question directly. The single-precision capability for example is available across all of the CUDA-supporting product lines, and will continue to be available everywhere, and that's the model that goes forward. So take the G84 and G86, we support CUDA there too. Now there's a separate roadmap that the computing products will follow, so double precision is a feature that clearly maps into HPC and GPU computing, but we can't see much use for it in the consumer space, so it's something that'll be available with Tesla and the high-end of the Quadro product line, but below that it'll only be single precision and DP will not be on the die.

So you must be happy to split things up like that, where DP is only available on Tesla or certain Quadro and GeForce doesn't get that. You're happy with that?Yeah, that maps to our product stack and business and markets that we're in. Professional customers don't buy mid-range boards.

http://www.beyond3d.com/content/interviews/41/3

看来DP和GeForce无缘了。

不过NVIDIA并没有说这个dp是不是G92上实现的。

作者: rongronglulu 时间: 2007-6-22 19:32

原帖由 Eji 于 2007-6-21 12:43 发表

開了也不一定可以好好利用.... 這和R600是一樣的狀況。
所以他們乾脆限制在Stream Processor產品底下才開....
G86可以開、但G84不能開的狀況可能也是G86實在太弱了不開不行。

而且可能會產生G80前後期 ...

不是，很久以前就有人说过。那个mul只能用于通用计算，用作GPU的时候就是不行的。

作者: clawhammer 时间: 2007-6-22 19:38

原帖由 rongronglulu 于 2007-6-22 19:32 发表

不是，很久以前就有人说过。那个mul只能用于通用计算，用作GPU的时候就是不行的。

谁说得。。。。。只是性能提升比较小而已

作者: XXR600 时间: 2007-6-23 00:06

原帖由 rongronglulu 于 2007-6-22 19:32 发表

不是，很久以前就有人说过。那个mul只能用于通用计算，用作GPU的时候就是不行的。

作GPU最少提升10%，没理论那么大罢了
C大有测试，开了mul的8500在farcry中有提升

作者: XXR600 时间: 2007-6-23 00:20

原帖由 zzhang 于 2007-6-21 17:47 发表

这个和Intel的多核心GPU设计以及Cell的多核心不是一个层次上的概念吧。感觉Intel的方案里每个核心就是一个微型的G80这样的东东。

你是说Larrabee 48核心每核心还有大量ALU并行？不觉得太夸张了吗？

作者: Eji 时间: 2007-6-23 01:14

原帖由 XXR600 于 2007-6-23 00:20 发表
你是说Larrabee 48核心每核心还有大量ALU并行？不觉得太夸张了吗？

其實這就是過去以quad (2x2)為單位的GPU系統啊。
G80其實ALU數量和G70的時代在1quad的狀況下幾乎沒有什麼變化，Larrabee的16way Vector 對4thread其實可能也有這種感覺。
所以Larrabee的核心算法其實也有這種爭議存在，48core到底是6個4way4D的集合體，還是真的48個16way。

作者: Eji 时间: 2007-6-23 01:17

原帖由 rongronglulu 于 2007-6-22 19:32 发表
不是，很久以前就有人说过。那个mul只能用于通用计算，用作GPU的时候就是不行的。

應該是沒這麼明顯，從uiuc的教學文件可以看到，GPGPU很多好幾次最佳化才得到最佳值的case。
這年頭遊戲不會這樣寫了(都在爭time to market)。

所以其實如果在GPGPU底下，R600和G80應該不會真的差很多，兩邊都有很多case可以衝到幾乎理論值才對。
但是graphic最近co-issue實在是很派不上用場...._A_

作者: XXR600 时间: 2007-6-23 02:31

原帖由 Eji 于 2007-6-23 01:17 发表

應該是沒這麼明顯，從uiuc的教學文件可以看到，GPGPU很多好幾次最佳化才得到最佳值的case。
這年頭遊戲不會這樣寫了(都在爭time to market)。

所以其實如果在GPGPU底下，R600和G80應該不會真的差很多，兩 ...

R600和G80双方都冲到理论值的话，G80有518GF？ R600 475GF？

作者: Eji 时间: 2007-6-24 16:03
理論嗎....
R600：64Shader x5 x 2(MAD) x 750MHz = 480GFLOPS
G80：128SP x 3(MAD+MUL) x 1350MHz =518.4GFLOPS (8800GTX)、88U是1.5GHz所以有576GFLOPS

欢迎光临 POPPUR爱换 (https://we.poppur.com/)