|
Heterogeneous Integration
The graphics integration in Sandy Bridge is particularly novel as Intel is sharing the LLC with the GPU. The driver allocates regions of the cache at way granularity (128KB) – and can actually request the whole cache. Each thread can spill 32KB of data back to the LLC, for a total of nearly 2MB in the larger 12 shader core variants. Almost any GPU data can be held in the LLC, including vertices, textures and many other types of state.
The Sandy Bridge LLC and ring interconnect can rapidly pass data from the GPU back to the CPU – AMD’s Fusion is a far higher performance GPU, but that particular style of communication is discouraged. Since the GPU has a weaker ordering model, a flush command is needed to force data to be written back to the LLC prior to the CPU reading it. The driver can also allocate a portion of the LLC as a non-coherent cache for display data and other uses. For example, the results of transcoding might be written out to the the non-coherent region.
While this excellent system integration promises many benefits, at present it is restricted mainly to multimedia workloads. For graphics, it is largely an academic advantage to any but Intel’s driver team. The GPU is exposed through graphics APIs; yet neither OpenGL nor DirectX programs can interact with coherent memory and bypass I/O copies (let alone use the LLC). AMD has introduced an OpenCL extension for a zero copy mechanism on Windows systems already, and presumably Intel will follow once they have OpenCL and DirectCompute capable hardware. Intel’s graphics driver can take advantage of fast CPU/GPU communication, but that is only because it has raw access to the GPU hardware. These advances pave the way for Ivy Bridge and certainly promise good things in the future, but also serve to point out some of the deficiencies in the current generation.
The power controller (PCU) resides in the system agent (along with the DDR3 memory controller) and manages the frequency and voltage for different regions on the chip. The cores, ring interconnect and LLC are all on a single power and frequency plane, although each can be individually power gated. The GPU resides on a different power and frequency domain, as does the system agent. However, the power budget for Sandy Bridge is managed in a unified manner. So if the CPU cores are idle, the power and thermal headroom can be used by the GPU for turbo mode (and vica versa). As a result, the GPU base frequency is a relatively normal 850MHz for a high-end part. But the peak clock speed is an impressive 1.35GHz for the entire GPU, including the command streamer, setup engine and ROPs.
There is a down side to this arrangement though. The GPU is so tightly integrated into the system that it relies on the ring interconnect, LLC and memory controller for operation. But the frequency for the ring and LLC is determined by the cores, which may be running at a lower P-state than the GPU. For example, if the CPU cores are lightly loaded and the PCU switches them to a reduced voltage and frequency, then the bandwidth across the ring decreases as well – negatively impacting the performance of the GPU. This will almost certainly be fixed in Ivy Bridge by separating out the clock trees for the ring.
Conclusions
The Sandy Bridge Gen 6 graphics is a huge improvement for Intel and the PC ecosystem. It is the first graphics product that has taken advantage of Intel’s core competency in semiconductor manufacturing, using their cutting edge 32nm process technology. The system integration in Sandy Bridge is quite advanced and is a roadmap for the rest of the industry, namely AMD. In particular, it is clear that sharing the last level cache and unified power management are hugely beneficial to performance and power efficiency.
The graphics performance is good and overall seems to be about a 2X improvement over the Ironlake generation, which puts many games above the 30 frames/second mark that is key for playability. As earlier reviews showed, the performance is actually better than some entry level discrete graphics cards and with better power consumption. In practice the performance depends on factors such as the number of shader cores, size of the LLC and frequencies – which vary considerably from model to model. While the hardware is impressive, there are still glaring software deficiencies. The texture filtering seems to be lower quality than AMD and Nvidia’s implementations. Gen 6 also does not support OpenCL or DirectX 11 – this is understandably due to scheduling, but still a weakness.
Competitively, Sandy Bridge’s graphics was quite impressive at introduction in January 2011, but lags behind AMD’s Llano, which was launched in the middle of 2011. The Llano GPU is essentially twice as fast, which is not surprising given that it is also twice the die area. Moreover, Llano has full support for OpenCL 1.1 and DX11.
However, the multi-media capabilities of Sandy Bridge are industry leading. The video encoding performance is 7-9X higher than Ironlake or Llano, largely due to the fixed function hardware in the GPU. Additionally, these capabilities are accessible to 3rd party software developers through Intel’s Media SDK and have been adopted in quite a few applications.
Overall Sandy Bridge’s GPU is a welcome step forward for Intel, but a mixed bag. It is the best solution for multimedia, which is arguably the most common GPU workload. However, it does not truly match AMD’s graphics; rather it narrows the gap significantly. It will fall to future generations, such as the 22nm Ivy Bridge and Haswell to further close the gap both in terms of hardware and software.
The overall state of the industry and some of the deficiencies in Sandy Bridge hint at improvements for Ivy Bridge. Ivy Bridge will be a totally redesigned GPU with OpenCL, DirectCompute and DX 11 support and possibly extensions for task level parallelism. The shader cores will increase and probably be redesigned with some sort of shared memory or cache and greater execution resources. The floating point support will hopefully improve to full IEEE 754 compliance. The clocking of the ring interconnect will likely be independent of the CPU cores, so that the last level cache can run at peak performance for the GPU. Video encoding will probably be improved as well, perhaps with more fixed functions and broader codec support.
However, the most important area for improvement is Intel’s software environment, including both drivers and the overall programming model. A good software ecosystem is critical to efficiently leveraging heterogeneous resources. In particular, approaches that empower developers to easily share data (rather than explicitly copying) are ideal for performance and power efficiency. Ivy Bridge will undoubtedly be disclosed at IDF later this year, giving everyone plenty of time to ponder the changes before the first products arrive in 2012. |
|