Larrabee vs. Nvidia, MIMD vs. SIMD by Greg Pfister

Edison · 发表于 2009-7-7 14:28

http://perilsofparallel.blogspot.com/2008/09/larrabee-vs-nvidia-mimd-vs-simd.html

I'd guess everyone has heard something of the large, public, flame war that erupted between Intel and Nvidia about whose product is or will be superior: Intel Larrabee, or Nvidia's CUDA platforms. There have been many detailed analyses posted about details of these, such as who has (or will have) how many FLOPS, how much bandwidth per cycle, and how many nanoseconds latency when everything lines up right. Of course, all this is “peak values,” which still means “the values the vendor guarantees you cannot exceed” (Jack Dongarra’s definition), and one can argue forever about how much programming genius or which miraculous compiler is needed to get what fraction of those values.

Such discussion, it seems to me, ignores the elephant in the room. I think a key point, if not the key point, is that this is an issue of MIMD (Intel Larrabee) vs. SIMD (Nvidia CUDA).

I’d like to point to a Wikipedia article on those terms, from Flynn’s taxonomy, but their article on SIMD has been corrupted by Intel and others’ redefinition of SIMD to “vector.” I mean the original. So this post becomes much longer.

MIMD (Multiple Instruction, Multiple Data) refers to a parallel computer that runs an independent separate program – that’s the “multiple instruction” part – on each of its simultaneously-executing parallel units. SMPs and clusters are MIMD systems. You have multiple, independent programs barging along, doing things that may have nothing to do with each other, or may not. When they are related, they barge into each other at least occasionally, hopefully as intended by the programmer, to exchange data or to synchronize their otherwise totally separate operation. Quite regularly the barging is unintended, leading to a wide variety of insidious data- and time-dependent bugs.

SIMD (Single Instruction, Multiple Data) refers to a parallel computer that runs the EXACT SAME program – that’s the “single instruction” part – on each of its simultaneously-executing parallel units. When ILIAC IV, the original 1960s canonical SIMD system, basically a rectangular array of ALUs, was originally explained to me late in grad school (I think possibly by Bob Metcalfe) it was put this way:

Some guy sits in the middle of the room, shouts ADD!, and everybody adds.

I was a programming language hacker at the time (LISP derivatives), and I was horrified. How could anybody conceivably use such a thing? Well, first, it helps that when you say ADD! you really say something like “ADD Register 3 to Register 4 and put the result in Register 5,” and everybody has their own set of registers. That at least lets everybody have a different answer, which helps. Then you have to bend your head so all the world is linear algebra: Add matrix 1 to matrix 2, with each matrix element in a different parallel unit. Aha. Makes sense. For that. I guess. (Later I wrote about 150 KLOC of APL, which bent my head adequately.)

Unfortunately, the pure version doesn’t make quite enough sense, so Burroughs, Cray, and others developed a close relative called vector processing: You have a couple of lists of values, and say ADD ALL THOSE, producing another list whose elements are the pair wise sums of the originals. The lists can be in memory, but dedicated registers (“vector registers”) are more common. Rather than pure parallel execution, vectors lend themselves to pipelining of the operations done. That doesn’t do it all in the same amount of time – longer vectors take longer – but it’s a lot more parsimonious of hardware. Vectors also provide a lot more programming flexibility, since rows, columns, diagonals, and other structures can all be vectors. However, you still spend a lot of thought lining up all those operations so you can do them in large batches. Notice, however, that it’s a lot harder (but not impossible) for one parallel unit (or pipelined unit) to unintentionally barge into another’s business. SIMD and vector, when you can use them, are a whole lot easier to debug than MIMD because SIMD simply can’t exhibit a whole range of behaviors (bugs) possible with MIMD.

Intel’s SSE and variants, as well as AMD and IBM’s equivalents, are vector operations. But the marketers apparently decided “SIMD” was a cooler name, so this is what is now often called SIMD.

Bah, humbug. This exercises one of my pet peeves: Polluting the language for perceived gain, or just from ignorance, by needlessly redefining words. It damages our ability to communicate, causing people to have arguments about nothing.

Anyway, ILLIAC IV, the CM-1 Connection Machine (which, bizarrely, worked on lists – elements distributed among the units), and a variety of image processing and hard-wired graphics processors have been rather pure SIMD. Clearspeed’s accelerator products for
HPC are a current example.

Graphics, by the way, is flat-out crazy mad for linear algebra. Graphics multiplies matrices multiple times for each endpoint of each of thousands or millions of triangles; then, in rasterizing, for each scanline across each triangle it interpolates a texture or color value, with additional illumination calculations involving normals to the approximated surface, doing the same operations for each pixel. There’s an utterly astonishing amount of repetitive arithmetic going on.

Now that we’ve got SIMD and MIMD terms defined, let’s get back to Larrabee and CUDA, or, strictly speaking, the Larrabee architecture and CUDA. (I’m strictly speaking in a state of sin when I say “Larrabee or CUDA,” since one’s an implementation and the other’s an architecture. What the heck, I’ll do penance later.)

Larrabee is a traditional cache-coherent SMP, programmed as a shared-memory MIMD system. Each independent processor does have its own vector unit (SSE stuff), but all 8, 16, 24, 32, or however many cores it has are independent executors of programs. As are each of the threads in those cores. You program it like MIMD, working in each program to batch together operations for each program’s vector (SIMD) unit.

CUDA, on the other hand, is basically SIMD at its top level: You issue an instruction, and many units execute that same instruction. There is an ability to partition those units into separate collections, each of which runs its own instruction stream, but there aren’t a lot of those (4, 8, or so). Nvidia calls that SIMT, where the “T” stands for “thread” and I refuse to look up the rest because this has a perfectly good term already existing: MSIMD, for Multiple SIMD. (See pet peeve above.) The instructions it can do are organized around a graphics pipeline, which adds its own set of issues that I won’t get into here.

Which is better? Here are basic arguments:
For a given technology, SIMD always has the advantage in raw peak operations per second. After all, it mainly consists of as many adders, floating-point units, shaders, or what have you, as you can pack into a given area. There’s little other overhead. All the instruction fetching, decoding, sequencing, etc., are done once, and shouted out, um, I mean broadcast. The silicon is mainly used for function, the business end of what you want to do. If Nvidia doesn’t have gobs of peak performance over Larrabee, they’re doing something really wrong. Engineers who have never programmed don’t understand why SIMD isn’t absolutely the cat’s pajamas.

On the other hand, there’s the problem of batching all those operations. If you really have only one ADD to do, on just two values, and you really have to do it before you do a batch (like, it’s testing for whether you should do the whole batch), then you’re slowed to the speed of one single unit. This is not good. Average speeds get really screwed up when you average with a zero. Also not good is the basic need to batch everything. My own experience in writing a ton of APL, a language where everything is a vector or matrix, is that a whole lot of APL code is written that is basically serial: One thing is done at a time.

So Larrabee should have a big advantage in flexibility, and also familiarity. You can write code for it just like SMP code, in C++ or whatever your favorite language is. You are potentially subject to a pile of nasty bugs that aren’t there, but if you stick to religiously using only parallel primitives pre-programmed by some genius chained in the basement, you’ll be OK.

[Here’s some free advice. Do not ever even program a simple lock for yourself. You’ll regret it. Case in point: A friend of mine is CTO of an Austin company that writes multithreaded parallel device drivers. He’s told me that they regularly hire people who are really good, highly experienced programmers, only to let them go because they can’t handle that kind of work. Granted, device drivers are probably a worst-case scenario among worst cases, but nevertheless this shows that doing it right takes a very special skill set. That’s why they can bill about $190 per hour.]

But what about the experience with these architectures in HPC? We should be able to say something useful about that, since MIMD vs. SIMD has been a topic forever in HPC, where forever really means back to ILLIAC days in the late 60s.
It seems to me that the way Intel's headed corresponds to how that topic finally shook out: A MIMD system with, effectively, vectors. This is reminiscent of the original, much beloved, Cray SMPs. (Well, probably except for Cray’s even more beloved memory bandwidth.) So by the lesson of history, Larrabee wins.

However, that history played out over a time when Moore’s Law was producing a 45% CAGR in performance. So if you start from basically serial code, which is the usual place, you just wait. It will go faster than the current best SIMD/vector/offload/whatever thingy in a short time and all you have to do is sit there on your dumb butt. Under those circumstances, the very large peak advantage of SIMD just dissipates, and doing the work to exploit it simply isn’t worth the effort.

Yo ho. Excuse me. We’re not in that world any more. Clock rates aren’t improving like that any more; they’re virtually flat. But density improvement is still going strong, so those SIMD guys can keep packing more and more units onto chips.

Ha, right back at ‘cha: MIMD can pack more of their stuff onto chips, too, using the same density. But… It’s not sit on your butt time any more. Making 100s of processors scale up performance is not like making 4 or 8 or even 64 scale up. Providing the same old SMP model can be done, but will be expensive and add ever-increasing overhead, so it won’t be done. Things will trend towards the same kinds of synch done in SIMD.

Furthermore, I've seen game developer interviews where they strongly state that Larrabee is not what they want; they like GPUs. They said the same when IBM had a meeting telling them about Cell, but then they just wanted higher clock rates; presumably everybody's beyond that now.

Pure graphics processing isn’t the end point of all of this, though. For game physics, well, maybe my head just isn't build for SIMD; I don't understand how it can possibly work well. But that may just be me.
If either doesn't win in that game market, the volumes won't exist, and how well it does elsewhere won't matter very much. I'm not at all certain Intel's market position matters; see Itanium. And, of course, execution matters. There Intel at least has a (potential?) process advantage.

I doubt Intel gives two hoots about this issue, since a major part of their motivation is to ensure than the X86 architecture rules the world everywhere.

But, on the gripping hand, does this all really matter in the long run? Can Nvidia survive as an independent graphics and HPC vendor? More density inevitably will lead to really significant graphics hardware integrated onto silicon with the processors, so it will be “free,” in the same sense that Microsoft made Internet Explorer free, which killed Netscape. AMD sucked up ATI for exactly this reason. Intel has decided to build the expertise in house, instead, hoping to rise above their prior less-than-stellar graphics heritage.

My take for now is that CUDA will at least not get wiped out by Larrabee for the foreseeable future, just because Intel no longer has Moore’s 45% CAGR on its side. Whether it will survive as a company depends on many things not relevant here, and on how soon embedded graphics becomes “good enough” for nearly everybody, and "good enough" for HPC.

他写的另一篇有趣的 weblog : Twilight of the GPU?

lv723859 · 发表于 2009-7-7 15:27

都是E文啊.

迪奥科技 · 发表于 2009-7-7 21:18

被伟墙绿坝ban了，什么也看不到

Prescott · 发表于 2009-7-7 21:36

1# Edison
blogspot被封了，看不了啊

迪奥科技 · 发表于 2009-7-7 22:20

能看到的，观点与这里讨论的情况差不多。。。。不过话说intel在软件方面的能力，个人终究定位为半残，若intel寄希望于游戏开发者的话，恐怕要掏大笔银子了

只看该作者 · 发表于 2009-7-9 10:06

提示: 作者被禁止或删除内容自动屏蔽

Edison · 发表于 2009-7-9 12:53

严格上来说，CUDA 不是 MIMD，至少目前不能实现跑多个 Kernel，当然这也和目前的驱动约束有关。

只看该作者 · 发表于 2009-7-9 14:47

提示: 作者被禁止或删除内容自动屏蔽

samsung · 发表于 2009-7-9 16:55

本帖最后由 samsung 于 2009-7-9 16:56 编辑

4# Prescott

Twilight of the GPU?
If a new offering really works, it may substantially shrink the market for gaming consoles and high-end gamer PCs. Their demo runs Crysis – a game known to melt down extreme PCs – on a Dell Studio 15 with Intel's minimalist integrated graphics. Or on a Mac. Or a TV, with their little box for the controls. The market for Nvidia CUDA, Intel Larrabee, IBM Cell, AMD Fusion, are all impacted. And so much for cheap MFLOPS for HPC, riding on huge gamer GPU volumes.
This game-changer – pun absolutely intended – is Onlive. It came out of stealth mode a few days ago with an announcement and a demo presentation (more here and here) at the 2009 Game Developers' Conference. It's in private beta now, and scheduled for public beta in the summer followed by full availability in winter 2009. Here's a shot from the demo showing some Crysis player-versus-player action on that Dell:

What Onlive does seems kind of obvious: Run the game on a farm/cloud that hosts all the hairy graphics hardware, and stream compressed images back to the players' systems. Then the clients don't have to do any lifting heavier than displaying 720p streaming video, at 60 frames per second, in a browser plugin. If you're thinking "Cloud computing meets gaming," well, so are a lot of other people. It's true, for some definitions of cloud computing. (Of course, there are so many definitions that almost anything is true for some definition of cloud computing.) Also, it's the brainchild of Steve Perlman, creator of Web TV, now known as MSN TV.

Now, I'm sure some of you are saying "Been there, done that, doesn't work" because I left out something crucial: The client also must send user inputs back to the server, and the server must respond. This cannot just be not a typical render cloud, like those used to produce production cinematic animation.

That's where a huge potential hitch lies: Lag. How quickly you get round-trip response to inputs. Onlive must target First Person Shooters (FPS), a.k.a. twitch games; they're collectively the largest sellers. How well a player does in such games depends on the player's reaction time – well, among other things, but reaction time is key. If there's perceived lag between your twitch and the movement of your weapon, dirt bike, or whatever, Onlive will be shunned like the plague because it will have missed the point: Lag destroys your immersion in the game. Keeping lag imperceptible, while displaying a realistic image, is the real reason for high-end graphics.

Of course Onlive claims to have solved that problem. But, while crucial, lag isn't not the only issue; here's a list which I'll elaborate on below: Lag, angry ISPs, and game selection & pricing.

Lag

The golden number is 150 msec. If the displayed response to user inputs is even a few msec. longer than that, games feel rubbery. That's 150 msec. to get the input, package it for transport, ship it out through the Internet, get it back in the server, do the game simulation to figure out what objects to change, change them (often, a lot: BOOM!), update the resulting image, compress it, package that, send it back to the client, decompress it, and get it on the screen.

There is robust skepticism that this is possible.

The Onlive folks are of course exquisitely aware that this is a key problem, and spent significant time talking vaguely about how they solved it. Naturally such talk involves mention of roughly a gazillion patents in areas like compression.

They also said their servers were "like nothing else." I don't doubt that in the slightest. Not very many servers have high-end GPUs attached, nor do they have the client chipsets that provide 16-lane PCIe attachment for those GPUs. Interestingly, there's a workstation chipset for Intel's just-announced Xeon 5500 (Nehalem) with four 16-lane PCIe ports shown.

I wouldn't be surprised to find some TCP/IP protocol acceleration involved, too, and who knows – maybe some FPGAs on memory busses to do the compression gruntwork?

Those must be pretty toasty servers. Check out the fan on this typical high-end GPU (Diamond Multimedia using ATI 4870):

The comparable Nvidia GeForce GTX 295 is rated as consuming 289 Watts. (I used the Diamond/ATI picture because it's sexier-looking than the Nvidia card.) Since games like Crysis can soak up two or more of these cards – there are proprietary inter-card interconnects specifically for that purpose – "toasty" may be a gross understatement. Incendiary, more likely.

In addition, the Golive presenters made a big deal out of users having to be within a 1000-mile radius of the servers, since beyond that the speed of light starts messing things up. So if you're not that close to their initial three server farms in California, Texas, Virginia, and possibly elsewhere, you're out of luck. I think at least part of the time spent on this was just geek coolness: Wow, we get to talk about the speed of light for real!

Well, can they make it work? Adrian Covert reported on Gismodo about his hands-on experience trying the beta, playing Bioshock at Onlive's booth at the GDC (the server farm was about 50 miles away). He saw lag "just enough to not feel natural, but hardly enough to really detract from gameplay." So you won't mind unless you're a "competitive" gamer. There were, though, a number of compression artifacts, particularly when water and fire effects dominated the screen. Indoor scenes were good, and of course whatever they did with the demo beach scene in Crysis worked wonderfully.

So it sounds like this can be made to work, if you have a few extra HVAC units around to handle the heat. It's not perfect, but it sounds adequate.

-----------------------------------------------------------------------
Welcome to Hampstead, London. ( I am now working as the barrister )

Prescott · 发表于 2009-7-10 00:32

9# samsung
多谢

帐号		自动登录	找回密码
密码			注册

RacingPHT 该用户已被删除	6^# 发表于 2009-7-9 10:06 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
RacingPHT 该用户已被删除
	回复支持反对使用道具举报显身卡

RacingPHT 该用户已被删除	8^# 发表于 2009-7-9 14:47 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
RacingPHT 该用户已被删除
	回复支持反对使用道具举报显身卡

Larrabee vs. Nvidia, MIMD vs. SIMD by Greg Pfister

浏览过的版块