The Era of Tera: Intel Reveals more about 80-core CPU [ISSCC2007]

Edison · 发表于 2007-2-12 18:29

The Era of Tera: Intel Reveals more about 80-core CPU
Date: Feb 11, 2007
Type: CPU & Chipset
Manufacturer: Intel
Author: Anand Lal Shimpi

Page 1
With no Spring Intel Developer Forum happening this year in the US, we turn to the International Solid-State Circuits Conference (ISSCC) for an update on Intel's ongoing R&D projects. Normally we'd hear about these sorts of research projects on the final day of IDF, these days presented by Justin Rattner, but this year things are a bit different. The main topic at hand today is one of Intel's Tera-scale computing projects, but before we get to the chip in particular we should revisit the pieces of the puzzle that led us here to begin with.

Recapping Tera-ProblemsAt the Spring 2005 Intel Developer Forum, Justin Rattner outlined a very serious problem for multi-core chips of the future: memory bandwidth. We're already seeing these problems today, as x86 single, dual and quad core CPUs currently all have the same amount of memory bandwidth. The problem becomes even more severe when you have 8, 16, 32 and more cores on a single chip.

The obvious solution to this problem is to use wider front side and memory buses that run at higher frequencies, but that solution is only temporary. Intel's slide above shows that a 6-channel memory controller would require approximately 1800 pins, and at that point you get into serious routing and packaging constraints. Simply widening the memory bus and relying on faster memory to keep up with the scaling of cores on CPUs isn't sufficient for the future of microprocessors.

So what do you do when a CPU's access to data gets slower and more constrained? You introduce another level in the memory hierarchy of course. Each level of the memory hierarchy (register file, L1/L2/L3 cache, main memory, hard disk) is designed to help mask the latency of accessing data at the level immediately below it. The clear solution to keeping massively multi-core systems fed with data then is to simply put more memory on die, maybe an L4 cache perhaps?

The issue you run into here is that CPU die space is expensive, and the amount of memory we'd need to keep tens of cores fed is more than a few megabytes of cache can provide. Instead of making the CPU die wider, Intel proposed to stack multiple die on top of each other. A CPU die, composed of many cores, would simply be one layer in a chip that has integrated DRAM or Flash or both. Since the per-die area doesn't increase, the number of defects don't go up per die.

Memory bandwidth improves tremendously, as your DRAM die can have an extremely wide bus to connect directly to your CPU cores. Latency is also much improved as the CPU doesn't have to leave the package to get data stored in any of the memory layers.

Obviously there will still be a need for main memory, as Intel is currently estimating that a single layer could house 256MB of memory. With a handful of layers, and a reasonably wide external memory bus, keeping a CPU with tens of cores fed with data now enters the realm of possibility.

A year and a half later, Rattner was back but this time he was tackling another aspect of the era of tera - bus bandwidth. Although 3D die stacking will help keep many cores on a single die fed with data, the CPU still needs to communicate with the outside world. FSB technology, especially from Intel, has remained relatively stagnant over the past several years. If we're talking about building CPUs with tens of cores, not only will they need tons of memory bandwidth but they'll also need a very fast connection to the outside world.

Intel's research into Silicon Photonics has produced a functional hybrid silicon laser demonstrated at the Intel Developer Forum late last year. The idea is that optical buses can offer much better signaling speed and power efficiency than their electrical equivalents, resulting in the ideal bus for future massively multi-core CPUs.

Justin Rattner's keynotes talked about some of Intel's Tera-scale projects, with 3D die stacking delivering terabytes of bandwidth needed for the next decade of CPUs and silicon photonics enabling terabits of I/O for connecting these CPUs to the rest of the system. The final vector that Rattner spoke about, was delivering a teraflop of performance. The CPU Rattner spoke of was a custom design by Intel that featured 80 cores on a single die, and today Intel revealed a lot more about its Teraflop CPU, the architecture behind it and where it fits in with the future of Intel CPUs.

Page 2 The ChipAs its name implied, the Teraflops Research Chip is a research vehicle and not a product. Intel has no intentions of ever selling the chip, but technology used within the CPU will definitely see the light of day in future Intel chip designs.

The Teraflops chip is built on Intel's 65nm process and features a modest, by today's standards, 100M transistors on a 275mm^2 die. As a reference point, Intel's Core 2 Duo, also built on a 65nm process, features 291M transistors on a 143mm^2 die. The reason the Teraflops chip is large given its relatively low transistor count is that there's very little memory on the chip itself, whereas around half of Intel's Core 2 is made up of L2 cache. Other than being predominantly logic circuits, the Teraflops chip also has a lot of I/O circuitry on it that can't be miniaturized as well as most other circuits resulting in a larger overall chip size. The chip features 8 metal layers with copper interconnects.

The Teraflops chip is built on a single die composed of 80 independent processor cores, or tiles as Intel is calling them. The tiles are arranged in a rectangle 8 tiles across and 10 tiles down; each tile has a surface area of 3mm^2.

The chip uses a LGA package like Intel's Core 2 and Pentium 4 processors, but features 1248 pins. Of the 1248 pins on the package, 343 of them are used for signaling while the rest are predominantly power and ground.

The chip can operate at a number of speeds depending on its operating voltage, but the minimum clock speed necessary to maintain its teraflop name is 3.13GHz at 1V. At that speed and voltage, the peak performance of the chip with all 80 cores active is 1 teraflop while drawing 98W of power. At 4GHz, the chip can deliver a peak performance of 1.28 TFLOP, pulling 181W at 1.2V. On the low end of the spectrum, the chip can run at 1GHz, consuming 11W and executing a maximum of 310 billion floating point operations per second.

Page 3 The ArchitectureDespite being built on a large die, the individual tiles in the Teraflops chip are extremely simple cores. These aren't x86 cores, although Intel indicated that one of the next steps for the project was to integrate x86 cores. At a high level, each tile is composed of a Processing Engine (PE) to handle all computations and a 5-port router to pass data from one tile to the next.

In order to keep the tile hardware as simple as possible, the tiles are based on a 96-bit Very Long Instruction Word (VLIW) architecture. Intel's other famous VLIW architecture is of course the Itanium, but there's very little else that's in common between the two designs. In short, a VLIW architecture simplifies hardware design by relying on the compiler to schedule instructions for execution rather than having the CPU figure out how to dynamically parallelize and schedule operations. VLIW isn't common for desktop architectures, but for specialized applications it's not far fetched. The number of applications you've got to run on these things is limited, and thus adding complexity on the compiler side isn't such a bad tradeoff.

There are obvious drawbacks to going with a VLIW architecture, but it appears that Intel's fundamental goals with the teraflops chip were to deal with implementing a many-core CPU and not necessarily deliver a high performance one.

The processing engine is composed of a 3KB single-cycle instruction memory, 2KB data memory, 10-port register file, and two single-cycle throughput single-precision floating point multiply-accumulator units.

A maximum of 8 operations can be encoded in a single VLIW instruction on the teraflops chip. Those operations can be FPMACs, loads/stores, as well as instructions to the router on each tile as each tile can pass data and instructions on to any adjacent tile.

Although the chip itself is capable of processing over one trillion floating point operations per second, don't be fooled by the numbers; these aren't 128-bit FP operations but rather single-precision FP operations. Each tile features two fully pipelined 32-bit floating point multiple-accumulator (FPMAC) units. There are no other execution units on each tile, so all arithmetic operations must be carried out through these FPMACs. This obviously limits the applications that the teraflops chip can be used in, but it also supports the idea that the point of this chip isn't to break speed barriers, but rather develop a framework to introduce other more capable processors with many cores. The real focus here isn't on the floating point throughput of the array of tiles; instead, the primary objective is to work on the network that connects the tiles together.

Page 4 The NetworkEach of the 80 tiles in the teraflops chip are identical, which helps simplify design and manufacturing. As we mentioned earlier, each tile has two primary components: the Processing Engine (PE) and a 5-port router.

The router on each tile is used to pass data and instructions between tiles in the network. The tile passing the data doesn't have to even work on the data, it can simply be used for its router and not its PE. As such, the PE on a tile can be powered down independently of its router to save power.

The router on each tile features five 39-bit ports that can offer a total bandwidth of 80GB/s if the chip is operating at 4GHz; the data busses are double pumped. Of the 3mm^2 tile die area, only 0.34mm^2 is used for the router making it reasonable to have 80 of them on a single chip.

Four of the five ports are used for connecting to other titles as you can see from the slide below:

The fifth port is used for connecting to stacked memory, which Intel tells us we'll be hearing about in another quarter or so. For heat reasons the stacked memory will actually be mounted below the teraflop processor die.

The main attraction of the router and the network layout of the chip in general is that the PE can be replaced by anything, including an x86 core or a special purpose core (e.g. DSP or hardware encryption engine). Instead of a network of 80 tiles, you can imagine one with maybe 12, six of which are general purpose x86 cores and the rest are specialized cores to handle things like 3D rendering, TCP/IP offload, encoding, etc... The router network works for 80 cores, and making it work for any other number of cores, whether less or more, is trivial.

Page 5 Clocks and Power ManagementIn a modern day microprocessor, making sure the clock signal arrives at the same time across all parts of the chip can be a difficult task for a designer, especially as CPU frequencies and chip area both increase. However it's a necessary part of chip design as the clock needs to arrive at all parts of the chip within tight parameters in order for the CPU to behave normally. Intel tells us that in modern day microprocessors, clock distribution is responsible for approximately 30% of a chip's total power consumption and thus any power savings you can make here will be significant.

The teraflop chip however isn't a conventional chip; as each tile is independent, the clock only needs to arrive to all parts of a 3mm^2 tile at the same time, not to the entire 275mm^2 chip. With this in mind, Intel designed the teraflop chip to allow for the clock to arrive at individual tiles out of phase. This approach means that tile-to-tile communication may end up being a bit slower than it could, but the power savings are tremendous. Intel estimates at the power required to distribute the clock to all of the tiles on the teraflop chip at 4GHz is 2.2W, or 1.2% of total power consumption under load.

Obviously if we had a network of more complicated cores, distributing the clock within a larger more complex tile would require more power than this, but the take away point is that in a network like this you can simplify overall chip clock distribution by only worrying about the clock within a tile.

Clock management isn't the only area where Intel looked to save power, as the teraflop chip architecture is very power conscious in its design. Each tile is divided up into 21 individual sleep regions that can be powered down independently depending on the type of instruction being executed, not to mention that individual tiles can be powered down independently of one another. And as we mentioned before, the PE and router on each tile can be powered down independently.

Within the router itself, each one of the five ports can be powered down independently as well. With 80 cores, the teraflop chip can also redistribute load according to thermal needs. If a handful of tiles are getting too hot, the chip can dynamically wake up a different set of tiles to begin working in order to avoid creating hotspots.

The FPMACs remain in sleep mode until they are needed, so there's an additional latency penalty when waking them up but it prevents power consumption from spiking as soon as there's load which can help simplify power delivery and other elements of the chip as well. Approximately 90% of the FPMAC logic and a total of 74% of each PE uses sleep transistors to help reduce power consumption as we described above. Intel states that sleep transistors take up, on average, 5.4% more die area than regular transistors and come with a 4% frequency penalty, but the power savings are worth it. Sleep transistors are used in other Intel processors, including the Core 2 family.

Page 6 Final WordsAs Cell-like as Intel's Teraflop Research Chip may be, it's never going to be a product. In fact, there's no rhyme or reason to choosing 80 cores as the magic number, it was simply the number of cores Intel was able to cram into the design given the target die size. At the same time, the teraflop chip isn't here to set any performance records either, its sole purpose is as a test vehicle for its tiled network approach, power management with many cores, 3D stacked memory, and as a whole for the future of multi-core processors.

Intel stated that the two next steps for this research chip are to introduce a 3D stacked die, which we're suspecting Intel has already made significant progress on, and to introduce some more capable cores.
Intel's goal with this chip isn't to build a FP monster, but rather a fast general purpose chip. Chances are that what this chip evolves into won't be an array of 80 general purpose processors, but maybe an 8 or 16 core chip complete with x86 cores, specialized hardware, and 3D stacked memory.

What's the time to fruition for these technologies? Intel estimates 5 - 10 years before we see the tangible benefits of this research in products. By the end of the decade we can expect quad-core to be mainstream on the desktop, so by 2015 you can expect that many of these many-core problems we recapped here today will be a reality. Intel's Tera-scale research that we've seen today will hopefully mitigate those problems, leading to a much smoother transition to the era of tera.

http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2925

elisha · 发表于 2007-2-12 20:07

战不起来，so...

quwt · 发表于 2007-2-12 20:16

太长了,即使是中文也太长了...

zhxb99 · 发表于 2007-2-12 20:17

长篇鸟文，看不懂！

elisha · 发表于 2007-2-12 20:29

http://news1.mydrivers.com/pages/20070212095902_58798.htm
中文版

帐号		自动登录	找回密码
密码			注册