POPPUR爱换

 找回密码
 注册

QQ登录

只需一步,快速开始

手机号码,快捷登录

搜索
查看: 2106|回复: 0
打印 上一主题 下一主题

Jeremy Sugerman on the Cell\'s LS program.

[复制链接]
跳转到指定楼层
1#
发表于 2006-6-3 19:27 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式
Both indirectly and first hand I've encountered a lot of people trying to pierce the mystique of the Cell development environment. These tend to be people who are interested in high performance computing who've seen a lot of the Cell popular press but don't know anyone who has personal experience porting to and writing for Cell. As a result, they've all heard horrible rumours about and speculations about trying to run real applications in 256K of code + stack + data, awkward toolkits, etc.
I'd like to say that all of those rumours are massively overblown. They're spun around kernels of reality, but in many ways developing for Cell is just like developing for a chip multiprocessor (or any other SMP). If you don't want to involve the PPE core in your compute kernels (and you certainly don't need to) then there's write-once support code to spin up the 8 SPEs (or however many you want) and launch your apps. You write it once based on either the sample code or the tutorial and never look back. Occasionally, if your workload needs it, you add some very simple message passing for SPEs to signal the PPE when they need to be sent more work and the PPE to respond when more work is sent. Anyone who has ever written a work queue or used a socket for signalling can do this three days dead. It's no different for Cell than via Win32 events or BSD sockets. The APIs just have their own function names. Hooray.

Okay, so now you've gotten your SPE up and running. To be precise, just like starting a thread on any other OS, you've issued a library / system call that took a chunk of code (the ELF binary compiled for the SPE), an entry point (thread main), a void * with initial arguments, and some unimportant optional flags. What about the SPE code? You take gcc and hand it vanilla C code. Or, you take xlc and hand it vanilla C code if you think xlc is more elite than gcc (we don't. Other people around here do. It seems to vary according to personal taste and application). Okay, so it's not quite that easy. You can transparently use stack allocated memory and static / global arrays or objects that are small enough to fit in local store. You cannot transparently malloc huge chunks of memory or dereference pointers to large, system memory backed regions. However, you can mechanically convert every dereference of your big data structures into synchronous DMA and you'll come out with working code. If you're writing from scratch or have anything resembling an accessor function, this is near trivial. The DMA builtins actually take system memory pointers as their argument without any translation or anything. We hit this point with the ray tracer after a few days' messing around. And a major portion of that time was browsing reference material and one-time cobbling together of a Makefile with the correct include paths and what-not to run the cross-compile toolchain.

That's it. Porting in a nutshell (I know, why is it in a nutshell and how do you get it out? Don't ask). But, but, but, you splutter, what about having to restructure your whole application? What about fitting your code + stack + data in 256KB? You don't have to restructure anything. You always can run on the PPE. If you want performance from the SPEs, you will have to multithread at least part of your application into at least 8 threads (or however many SPEs you want to use). However, if you want performance with any chip including current conventional CPUs, you have to multithread the computationally dense portion of your application. Multithreading for Cell isn't intrinsically different from multithreading for a "normal" CPU. So, while I'm happy to grant that having to multithread your code to get performance can be a pain, it's no special barrier unqiue to Cell. Now, as for fitting in 256KB, well, that's a ton of space. I'm clearly a child of the wrong era and it's about to be very clear, but once all your application data is excluded (it lives in system memory, not local store), 256KB is great。 With four byte instructions (I have no idea what Cell's instruction width is, but 4 bytes is a fine proxy), splitting LS half and half instruction and data gets you 32K static instructions which is huge for a computational kernel. Moreover, dynamic linking just plain isn't hard and it naturally combines with overlays to allow arbitrarily large code executed in a fixed amount of space. That's too hard for you? Luckily only one guy needs to write it for Cell and we can all use it. However, that's really more for the future. The whole raycasting portion of our code compiles down to 60KB or so. Similarly, the stack size limitation just doesn't seem interesting to me. I've probably spent too long writing kernel code and other specialized code, but if you need more than 4KB, or maybe a whole 16KB, of stack space then you're not writing for performance. And if you don't want performance, get that code back on the PPE where it won't bother us. All told, in a fairly pessimistic scenario, you're left with 128KB or 96KB of space for data. If you're just replacing pointer dereferences with DMA that's tons of space. Actually, there's 2KB of register file (128 x 4-word wide) and that's probably enough.

What's that? Replacing pointer dereferences with DMA is unusuably slow? So it is. If you grant my conservative assessment that there's 96KB of LS available for data then just use it as a simple cache. It takes a day or so to code a simple direct mapped cache (remember, we're discussing how hard it is to get code up and running on Cell. By this point, the naive code was working in the last paragraph and we're just looking for any cheap extra boost) and it's a tiny amount of extra instructions. That 96KB of code is 6x the L1 on a Pentium 4 (and you've got 2KB of registers where the Pentium 4 has to use its L1 to compensate for its tiny handful of registers). So, in exchange for another day of porting (we're almost up to 1 week for 1 grad student who's easily distracted and a little lazy) not only have you ported your app, you've smoothed out the most unreasonable shortcut you used to get the port working.

If your application doesn't benefit from caching then the bad news is implementing the simple cache won't help you on Cell. However, the worse news is that your execution time on normal CPUs is already as bad as the synchronous DMA version on Cell. Or else, you have some a priori knowledge of your algorithm that lets you prefetch or do some other contortions on normal CPUs to get performance up. In that case, there's some good news-- there are no games on the SPE-- you just tell the DMA controller what you want it to prefetch and it goes and does it. No funny "we may or may not honour the hint" prefetch or non-temporal write instructions and no irritating hardware memory hierarchy working to thwart you. Programmer say, Cell do. Seriously, if you are lucky enough to have one of the workloads whose access pattern is structured then Cell is just great. Rather than hinting (or tricking) a CPU into doing the right thing to get bandwidth to memory, you lay it out explicitly for the DMA controller and it happens.

Anyhow, bottom line, we had a working port of the ray tracer in a couple of days and a reasonable starting point to beging analysis and optimization within about a week. The development environment / tool chain is fine (it's not *awesome*, but it's gcc). And the code is pretty much normal multithreaded C code with some funny Cell specific calls instead of pthread or Win32 calls for the scaffolding. So stop the fear mongering. Thanks.

Jeremy Sugerman研究的是可视化计算,例如ray tracing,上面的内容转自:
http://graphics.stanford.edu/~yoel/

在去年的时候他的小组获得了Cell blade,这以后他们就开始尝试在Cell上编写/修改他们的代码以获得CELL庞大的GLFOPS马力。
您需要登录后才可以回帖 登录 | 注册

本版积分规则

广告投放或合作|网站地图|处罚通告|

GMT+8, 2025-12-18 22:11

Powered by Discuz! X3.4

© 2001-2017 POPPUR.

快速回复 返回顶部 返回列表