IBM CTO on CELL Processor

Edison · 发表于 2006-5-21 13:15

The Sum Is Greater Than the Whole

Bernard Meyerson, chief technologist for IBM’s Systems & Technology Group, sat down with Electronic News to discuss the Cell processor, the importance of Moore’s Law and the future of electronics design. What follows are excerpts of that conversation.

Electronic News: What markets do you envision the Cell processor playing in and why?
Meyerson: The Cell processor -- or more correctly, the architecture around it -- leverages a combination of strategies that IBM brought to the market going back four or five years ago. Cell is more of a holistic or system view of what a processor has to be. Typically, in the past, people have constrained their thinking to megahertz and number crunching without actually going beyond that into the ultimate application of the device. In our case, what we’ve done is architect a solution as opposed to just a processor. What that means is there will be areas where it will be very powerful, and areas that will be inappropriate because you’ve chosen an architecture whose optimization is for a different space of applications.

Electronic News: Such as?
Meyerson: Cell, given its intended utilization, has tremendous capabilities in terms of handling acceleration of video and other imaging. As a consequence, it is designed to have tremendous processing capabilities through a multitude of accelerators working in concert with a core processor. That doesn’t make it a general purpose computer or a supercomputer, but for the application for which it’s intended, it provides as much as a 10x uplift relative to other processors today. It’s similar to the strategy we’ve been on in our systems unit. In the mid-1990s, we recognized the end of the road map for driving system performance, which is all the customer cares about, through pure megahertz.

Electronic News: So what about clock frequency?
Meyerson: Clock frequency is not the driver of system performance. That’s one of the great fallacies. And it becomes, to some extent, a red herring because clock frequency doesn’t guarantee system performance. It simply measures one metric. In the mid-1990s, we began looking at the alternative. That’s the total processing capability of a system, beginning at the chip level and working outward through a number of auxiliary processors, memory, cache, etc. If you take the broader view you recognize that you can get a better result by making tradeoffs to balance numerous aspects of performance --clock frequency, power utilization, actual processing or data throughput, integration of the processor elements with the rest of the system in terms of communication buses, the bus architecture, the software that implements each of those processor attributes at a different time in a process-all of these come together and give a better result. As an example, in 2001 IBM brought to market the first multicore processors, our Power 4 architecture.

Electronic News: They were on the high-end servers, correct?
Meyerson: Yes. We did this quietly, too. But it was a revolutionary thought in the marketplace, and it’s something you can leverage for quite some time until the necessity of going in that direction becomes evident to all. The necessity is driven by the fact that when you reduce the clock frequency you get a highly non-linear benefit -- what’s known as a superlinear benefit -- in power reduction. That’s another way of saying if you take a processor capable of running at 4 gigahertz and turn down the operating voltage by perhaps 20 percent, the actual speed might only decline a very much smaller amount --perhaps 10 percent. Put another way, if you run a processor frequency at half of what it’s capable of, you could conceivably save five times the amount of power. That kind of superlinear benefit to power means it may be more beneficial to put multiple slower processors on one die versus attempting to build one processor that runs at tremendous speed. Now we all nod our heads, but in 1996 that was revolutionary.

Electronic News: Do you foresee systems then with multiple chips doing one function rather than a single chip performing multiple functions?
Meyerson: The answer is yes, in general, but not just restricted to Cell. Cell is certainly capable of that. But as an architectural statement, when you look at systems like Blue Gene (IBM’s newest supercomputer), which is a multitude of Power-based cores running two per chip where the chip itself is far more than a processor unit, it has an entire architecture represented upon it. It involves a remarkable set of networks to coordinate the activities of a multitude of Blue Gene processors in a system.

Electronic News: But isn’t that the harder part, setting up the networks that all work together?
Meyerson: Yes. That’s why systems such as Blue Gene, and ideas such as Cell, involve a tremendous depth of communication amongst the chips and coordination capabilities across processors -- in addition to raw processing horsepower. Let me give you an example of why this is critical: If you’re going to architect a large system, the control or networking functions play a crucial role in that they have to be able to perform a variety of control tasks that prevent you from using a single network type. In the extreme case where you simply want to issue a global command to the entire system, the bandwidth you need is extraordinarily low because the command may take the form of literally issuing a stop bit where you momentarily halt execution to perform some critical function. That means the bandwidth can be very low for that network, but the latency of issuing that command has to approach zero to make sure you can synchronize the system. That’s one extreme. In another extreme, when you’re simply storing data from the result of each processor’s endeavors, you need tremendous bandwidth to store lots and lots of data, but you don’t care about the latency -- the delay in the data arriving at a storage unit -- because you’re not recovering data. You’re simply depositing a result. So you have these wildly different networks, one with low bandwidth but no latency, the other with extraordinary bandwidth but where you don’t care about delay. This drives you to have multiple networks and multiple capabilities. In Blue Gene, there are five different network types. That’s a holistic approach.

Electronic News: So let’s tie this back to the Cell processor.
Meyerson: In Cell, you have an endless amount of video capability, which is very data intensive. You have multiple parallel units capable of handling the processing of that data. Again, what we’re doing now is engineering solutions. This is a bit like the maturation of the semiconductor industry.

Electronic News: What does that mean for the Moore’s Law road map?
Meyerson: Moore’s Law and its continuance is an economic rather than a technical statement. The continuance of Moore’s Law is actually not relevant except as a cost statement for future chips. It is not a performance metric. It has been associated incorrectly with additional performance. There are a series of laws called Classical Scaling that are the glue that allowed you to make that extension. For example, if you say it’s going to shrink by 2x, you would assume that there would be a resulting improvement in performance. The fact that the area of the chip shrank by 2x had nothing to do with why the chip was faster or smaller. There are far more elements in a transistor that had to be shrunk that had nothing to do with the area. The key thing that’s happened is Classical Scaling -- the glue that connected Moore’s Law to performance -- terminated about three years ago. That was because some elements in devices no longer scaled.

Electronic News: At what node?
Meyerson: At about 130 nanometers we started to see the breakage. What occurred was that if you were unaware of this disconnect and extrapolated to 90 nanometers, you ran into serious problems with the power density of chips. The reason classical scaling failed was that key elements of the device, such as gate oxides, simply stopped scaling. You reached the point that to scale the oxide thickness was impossible because of reliability issues, current leakage through thin oxides, and other issues. Failing to scale the oxides also meant you couldn’t scale the operating voltage of a processor, because if you scaled the voltage downward without scaling the oxide thinner you lost performance. What happened was people were forced to maintain too high an operating voltage to meet their performance commitments and the consequence of maintaining that higher voltage was unacceptable power density. That was because scaling failed. Moore’s Law is not relevant. It’s strictly an economic statement about the size of a chip and the number of elements on that chip over time. That’s a cost statement, not a performance statement. Going forward, you can say Moore’s Law continues because you will continue to make each generation smaller. However, you cannot make that statement any longer without simultaneously pointing to the innovations you will be introducing to compensate for the fact that further shrinking of the chip does not ensure higher performance. That’s the key.

Electronic News: So what’s going to improve performance in the future?
Meyerson: Innovation. That will be the driver of performance, rather than scaling.

Electronic News: You’re talking about building solutions rather than chips. Do your metrics now become the system rather than the chip?
Meyerson: In truth, yes. In smaller devices, you will attempt to introduce the system on the chip and therefore they become one and the same, so you need a greater diversity of content to make that chip truly appealing as a product. That is an extreme where you integrate communications capability, memory buffering, all of the necessary hooks to enable power control, and essentially the elements you might find in a larger system all compressed on a single die. Looking at the higher level of systems, then, your true performance differentiation will depend upon how inclusively you’ve been in your system design. With holistic design, we describe designing a chip that supports virtualization. In addition to the implementation of physical chip partitioning -- multiple cores that can support multiple computing threads -- you can at the system level go one better. You can design a chip that supports micropartitioning by software because horsepower, even half of one core supporting one thread, may far exceed the need for compute power in some instances. Therefore, in order to give the customer the best cost/performance possible, we micropartition that one thread’s capability an additional 10 ways. When you virtualize the asset, you can have a hypervisor looking at the workload and determining the total compute power required by that workload and assigning as little as one-tenth of one half of the chip’s capability to service it.

Electronic News: Is this done architecturally or dynamically?
Meyerson: It’s dynamic. We have the ability to dynamically reassign the system’s capabilities as required, on the fly, 24 by 7. For example, you have an incoming workload that’s a series of transactions. Transactions occur on a human timescale, meaning you punch in numbers of your credit card and it’s verified by the system. You punch in the amount you’re going to pay and it’s verified by the system. It’s very slow. You want to be able to assign a miniscule fraction of the system’s capabilities to compute that. Virtualization enables you to do that and then put it back into a pool that can be used as it’s needed. To tie up an entire processor or one thread of a processor to service a simple addition would be a huge waste of capability.

Electronic News: Let’s take this up a few levels. What does all of this mean to the average person?
Meyerson: Despite the clear discontinuity in the trajectory of clock frequency, I do not see a discontinuity in the performance of IT. We have become accustomed to enjoying a 60 [percent] to 90 percent benefit in performance at the customer level each year, which will continue uninterrupted, if not accelerated by a new focus on holistic design.

Electronic News: What’s the next bottleneck, then? Is it still going to be outside bandwidth?
Meyerson: If you practice holistic design correctly, you achieve a balance where no one element becomes the laggard. That’s not to say there isn’t more work to be done to improve performance. But it becomes an issue of the cost/benefit of that improvement. You can add more parallel paths, but is it necessary for the particular application you have envisioned or is the application served well enough? Different markets will now behave differently. In communications and communicating devices, where the standards are set by governmental and industry groups, there is to some realistic level an external limiter that defines what’s good enough. The other extreme is that there will be enormous progress in the high end of computing due to the arrival of fundamentally new architectures. Blue Gene is the ultimate example of that. People have missed the significance of Blue Gene. It’s not that it’s the world’s fastest supercomputer. That’s ancillary. It’s a proof point, but not the discontinuity. The discontinuity comes when you compare Blue Gene with the machines it can outperform. It occupies about 1/100th of the floor space. It’s 1/100th the size. It’s roughly 1/28th the power for better performance. That’s the discontinuity. That’s a seminal shift in how you get the job done.

Electronic News: Doesn’t that mean you’re no longer confined to computers as we’ve known them, mainly a single box?
Meyerson: Absolutely. The space is opened up to explore new paradigms in both the scale-up dimension and the scale-out dimension in terms of system architectures. The world of differentiation has moved to the levels of systems and integration from the raw horsepower of the microprocessor. The microprocessor remains a key element, but it must be linked to the architecture that one is going to support at the system level. That’s why you don’t have one size fits all.

Electronic News: Is the system distributed, too?
Meyerson: It can be distributed, or it can be local, depending upon what you’re attempting to achieve. The systems have become immensely more powerful within their own footprint. They can become even more powerful by sharing resources through virtualization with remote locations across the world. We thought a long time about autonomic computing, where the computer is self-maintaining, self-optimizing. The key point is that the reality is here. We are now at the point where your machine is, in fact, self-protecting. Once in a while the message that appears on your screen says that a virus did not eat your system is not that a virus ate your system, but that your system ate a virus. This is an incredibly powerful thing we’ve done. As you network more freely, you obviously have exposure to those who would attempt to break the network. The good news is you have tremendous software and security tools available to mitigate the threat. The benefit, though, is enormous, because you’re moving into the autonomic world, which you allows you to take the next step.

Electronic News: And the next step is?
Meyerson: The inordinate and disportionate increase in the efficiency of the enterprise. Because you’re able to access these capabilities, they have become fully on-demand. On-demand is not a buzzword. It’s a description of how the ideal system behaves, where you don’t have the equivalent of inventory. Inventory is the anathema of the corporate balance sheet. Yet you don’t have the other anathema, which is a dramatic shortfall of resources when you need it.

Electronic News: This sounds a lot like the model for lean manufacturing.
Meyerson: This is a longstanding promise. Invention without execution is a sin. Innovation is the magic, where you not only invent, you reduce it to practice and then you take it to the world with some benefit to all those involved. Practicing innovation that matters -- innovation that gets out into the commercial world and the public domain and really enables a new capability and a new business -- is remarkable.

complexmind · 发表于 2006-5-21 21:19

来个翻译好么？？？

帐号		自动登录	找回密码
密码			注册