POPPUR爱换

 找回密码
 注册

QQ登录

只需一步,快速开始

手机号码,快捷登录

搜索
查看: 4241|回复: 6
打印 上一主题 下一主题

英特尔 SNB 图形核心(GT2)架构深入介绍

[复制链接]
跳转到指定楼层
1#
发表于 2011-8-10 02:01 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式
http://realworldtech.com/page.cfm?ArticleID=RWT080811195102

Dedicated Hardware

One of the largest improvements in Sandy Bridge is the move towards fixed function hardware, in contrast to the previous generation’s reliance on software.  Dedicated hardware has substantially better area and power efficiency than a software implementation that executes on programmable hardware.  For example, a benchmark at the Tech Report showed that Intel’s AES hardware boosts performance by nearly a factor of 8X on largely similar CPUs.

Intel’s architects also claim that moving algorithms and workloads from software to hardware removes considerable amounts of code from their graphics drivers.  This means a marginally faster and more efficient driver, but more importantly, it simplifies software development and testing by eliminating complexity.  Of course, the hardware must still be tested and validated, but that is frankly a more tractable and familiar problem to Intel than writing high performance drivers.

Dedicated hardware is not without costs and trade-offs though.  One subtle drawback is that on-chip buffering is necessary – and must be sized based on the performance of the overall pipeline and fixed function blocks.  Intel’s graphics architecture stores data for the fixed function units in the URB, which is shared across the whole chip for efficiency.


Figure 3 – Sandy Bridge and Ironlake GPU Overview


As shown in Figure 3, the previous generation uses software threads to assist with the clipping and setup stages of the 3D pipeline.  Ironlake’s clipping test is done in hardware for common cases, but more complex testing requires software.  To actually clip a vertex, the GPU will spawn a clipping thread that is dispatched to the shader array. The Sandy Bridge clipper is much more powerful and handles both testing and actual vertex clipping.  This dedicated hardware replaces and eliminates any software clipping.

The setup stage is responsible for assembling clipped vertices into 3D objects for rasterization.  Ironlake spawns setup threads that calculate attribute interpolation for Z and 1/W, while hardware interpolates X & Y.  Sandy Bridge moves the setup phase entirely into fixed functions, again reducing the burden on the shader array and drivers.

Sandy Bridge can setup and rasterize a triangle every 4 clock cycles.  This is fairly slow compared to AMD and Nvidia GPUs, which typically rasterize 1 triangle per clock, although newer models can achieve 2-4 triangles/cycle.  In reality, the Gen 6 rasterizer runs at roughly twice the clock frequency of discrete GPU so the performance gap is lower than it appears.  

Intel also added substantial fixed function hardware for media decoding and encoding.  The Multi-Format Codec (MFX) in Sandy Bridge has full hardware decoding for MPEG2, VC1 and AVC to reduce power consumption, whereas Ironlake performed motion compensation and filter deblocking in software.  The encoding is fully accelerated for AVC and H.264, although this uses a combination of fixed function hardware and the programmable shader cores.

GPU Cores

The Sandy Bridge shader array has 12 cores for the high-end GT2, and 6 cores for the GT1 variant.  The shader array is organized into rows of 3 cores, and each row shares an L1 instructuction cache.  The older Gen 5 design also shares a transcendental math unit between each row.  Collectively the entire GPU shares an L2 instruction cache, the URB, a texture sampling pipeline and a raster output pipeline.  As previously described, the shader cores are very flexible and can execute in either a SIMD or scalar fashion.

While Sandy Bridge has the same number of cores as Ironlake (12), the microarchitectures are substantially different.  The newer cores have a much more powerful instruction set, more resources and better access to special purpose hardware.  Overall the performance per core is roughly double.  Yet another example of how the number of cores in a GPU (or CPU) is a misleading and nearly irrelevant metric.

The thread dispatcher is responsible for sending various types of threads (vertex, geometry, pixel, media) to the programmable graphics cores for execution.  As with all GPUs, the cores are multi-threaded to hide latency.  Sandy Bridge’s GT2 cores can have 5 threads in-flight at once, for a total of 60 threads across the GPU.  The lower-end GT1 cores are limited to 4 threads each, with half the number of cores overall.  The older Ironlake design actually supports 6 threads, but Intel reduced this for Sandy Bridge because they moved all clipping and setup from programmable threads into fixed function hardware.  

Threads are primarily instantiated by fixed function hardware, although media threads can also spawn child threads.  The data needed to start a thread is typically buffered in the URB and then sent to the thread dispatcher.  Thread readiness is based on input and output requirements and resource availability, including constants, the URB, scratch space and the actual shader cores.  For example, two vertices are usually required to dispatch a vertex thread, and a geometry thread must have all vertices available.  Similarly, pixel threads are dispatched to shade 2, 4 or 8 pixel quads.

All threads are classified as high or low priority, and the dispatcher uses a round-robin algorithm within each priority class.  When a thread is selected, the dispatcher will assign it to a core and also send the input data which is copied into registers.  Upon completion, threads will send a termination message to the dispatcher.

2#
 楼主| 发表于 2011-8-10 02:03 | 只看该作者
Shader Front-end

The thread scheduling within a core is primarily hardware managed. The highest priority thread with a ready instruction is sent down the pipeline and can execute for several cycles. A thread will stall if an instruction is still waiting for operands and will be switched out. There is a certain degree of software management though. An instruction can force a thread switch upon completion, and atomic instructions will always have the highest priority.

One of the major enhancements to Gen 6 is better control flow. The Sandy Bridge graphics cores have new instructions and native support for while loops, calls, returns, indexed jumps and case statements. Additionally, there is an instruction pointer for each channel and infinite nesting capabilities for recursion. In Ironlake, infinite nesting required software assistance, which dramatically reduced performance.



Figure 4 – Shader Front-end Comparison


The 4KB L1 instruction cache is shared by a row of 3 cores, to improve re-use and exploit locality. The L1 caches are backed by a single shared 24KB L2 instruction cache. The instruction caches use 64B lines that contain 4 fixed length instructions and do not support self-modifying code. Presumably there is an instruction fetch buffer in each core that can hold 4 instructions per thread.

Instruction Set

The Gen 5 & 6 instruction set is fairly complex and quite powerful. The overall vector length of a single instruction is the (vector width * channels) and is limited by the register size. There are also compressed instructions, which operate on twice as many data elements and are decoded into two native instructions. Conditional execution is achieved using predication, destination masks and execution masks.

To improve efficiency and flexibility, the operands for instructions do not need to be aligned within the register file. Instructions can use region register addressing for operands, which is essentially a 2-dimensional strided gather within the register file that can span multiple physical registers. This is particularly handy for avoiding packing and unpacking of data structures or working with media data that has irregular alignment. Additionally, indirect register operands are available, where a separate address register and an offset indicate the location of the actual input operand.

Instructions access registers in either a 16B or 1B aligned mode. The 16B aligned mode is intended for 4-component data (RGBA) packed into 16B blocks. This mode has full source swizzling and destination masking for operands, but limited register regions since any accesses must be 16B aligned within the register file. The 1B aligned mode is targeted for SOA execution. Operands must be aligned to their natural data type within the register file (down to 1B) and can use the full register region addressing capabilities, however, source swizzling and destinating masking are disabled. Together these two modes mean that the GPU is mostly indifferent to the data formatting and can easily transpose a data structure from SOA to AOS.

The majority of instructions are 3-operand with source swizzling and destination masking, although some have an implied fourth operand. Both inputs can use region addressing, but only the first source operand can use indirect addressing. A few instructions, such as the multiply-add, have 3 inputs, but they are required to use 16B alignment.

Registers

Registers in the Gen 5 & 6 architecture are all 256-bits (32B) wide, and most operations expect similar sized chunks of data. The register size is related to the different execution modes discussed previously. When using single precision floating point data, each input operand for SIMD1x8 instructions perfectly corresponds to a single register. The longer SIMD1x16 instructions are compressed and will essentially decode into two separate instructions with a register for each input operand. A Gen 6 core contains a general purpose register file (GRF), a message register file (MRF) and an architecture register file (ARF, not shown).

The Gen 6 GRF is 640 entries for a total of 20KB of data and is used for computation by each thread. Each of 5 threads will allocate 128 entries when it is dispatched to a core – unlike AMD and Nvidia, threads do not have a variable number of registers. The thread can freely read and write to the GRF and spill to a 32KB region of memory that is held in Sandy Bridge’s L3 cache. Each register holds multiple values, and the Gen 6 architecture natively supports sub-register accesses as small as 1 byte and up to 4B or 32-bits; there is no double precision currently. The register file is physically split into odd and even banks that can be accessed in parallel for high bandwidth; every cycle a bank can read a register and write a register.

The other two register files are special purpose. The MRF is used by the messaging framework to communicate with other cores and fixed function blocks in the GPU. It contains 24 registers (0.75KB) per thread and is a write-only structure; each thread writes messages into the MRF that will be sent to other parts of the GPU. When a message is sent or returned to a core, the contents will actually be written into the GRF so that the data can be subsequently read by the receiving thread. This technique is used to pass data between threads on the cores, and also to initialize data values in the GRF when a thread is first dispatched. A thread may have multiple messages queued up or in-flight at any time.

The ARF contains a variety of registers that are used for managing and controlling the threads in a core. This includes registers that hold the instruction pointer (IP), thread priority and dependency information, notifications from the messaging framework and flags for control flow, per-channel IPs and exceptions. The ARF also includes 2 address registers which are used for indirect register addressing and 2 accumulator registers for higher precision operations.
回复 支持 反对

使用道具 举报

3#
 楼主| 发表于 2011-8-10 02:05 | 只看该作者
Shader Back-end

With all the complex register access in the Gen architecture, scoreboarding is necessary to avoid destination hazards.  If two instructions could write to the same destination, the scoreboard will stall the second instruction until the first has safely finished.  Many of these stalls are not necessary, e.g. two instructions could write to different portions of a register, or one instruction might get masked off.  The driver software can actually override the scoreboarding, although this may yield product incorrect results.

Data Types

Sandy Bridge and Ironlake have a variety of data types.  The integer data types are fairly straight forward including byte, word (16-bit), double word (32-bit) and the slightly unusual half-byte and 32-bit packed half-byte.  Floating point data is available in single precision, restricted (8-bit) and a 32-bit packed restricted form.

The GPU has two FP modes – IEEE and ‘alternate’.  The IEEE mode is partially compliant with the relevant standards and has NaNs, infinities and denormals. However, there are some deviations including rounding behavior and denorm handling.  The alternate is a graphics specific mode that does not have NaN, infinity or denormals.  Extremely large (or small) numbers will saturate at the maximum (or minimum) that can be represented.  To avoid any NaNs, special functions such as log, reciprocal square root and square root will take the absolute value of any input to guarantee good behavior.  The advantage of alternate mode is higher performance and greater freedom in software optimization.  For example, in alternate FP, multiplying by 0 is always 0 – that is not true for IEEE.

Execution Units

The execution units in each Gen 6 core have been substantially beefed up compared to the Ironlake generation.  Both Sandy Bridge and Ironlake cores have a 128-bit wide vector execution unit that natively executes eight 16-bit or four 32-bit operations per clock cycle.  While data can be stored in 4-bit or 8-bit formats for compression, it is expanded to 16-bits for actual execution and has similar throughput.  The shader cores also execute media operations such as sum-of-absolute-differences in the vector pipeline.  



Figure 5 – Shader Back-end Comparison


The older Ironlake core does not have any multiply-accumulate or multiply-add instructions, so the peak throughput is 4 SP FLOP/cycle.  The Gen 6 vector execution unit has both multiply-add and multiply-accumulate.  The latter implicitly uses a high precision accumulator in the ARF for each channel of execution.  The peak throughput for the Gen 6 GPU is 129.6 GFLOP/s (using the turbo frequency of 1.35GHz), compared to 43.2 GFLOP/s for Ironlake (turbo frequency of 0.9GHz).  The Gen 6 core added a couple of new instructions, plane equation and linear interpolation, which are fairly common in graphics and were previously synthesized in software rather than directly executed.

The Gen instructions are variable length vectors and typically longer than the hardware’s execution resources.  The longest uncompressed instruction is 8x32b operations (for the SIMD1x8 mode), which takes two cycles to execute.  A compressed instruction can take 4 cycles to execute on 16 data items.  This multi-cycle execution is similar to the behavior of an AMD wavefront, with the added twist that the instruction latency is non-uniform, which complicates scheduling slightly.

Just as importantly as the improvements in the vector unit, Sandy Bridge has dramatically improved performance for special math functions such as transcendentals.  Previously, Ironlake shared a single 32-bit math unit between an entire row (3 cores).  The math instructions included inverse, log, square root, reciprocal square root, exponentiation, power, sine, cosine and integer divide.  These instructions were sent through the messaging framework to the math unit and most took 22-88 cycles per data element.  The more complicated trigonometric instructions typically took 132 clocks, but could be as high as 264 cycles for each data element.

The Gen 6 core has a 32-bit dedicated math unit with a new floating point divide instruction.  The math unit is also faster for some transcendental instructions than the previous generation, particularly the trigonometric ones.  Threads can issue one instruction to the math unit or the vector unit; however, the latency of most math instructions is fairly long, so the execution is still mostly simultaneous.  

Sandy Bridge has a much more balanced ratio of arithmetic and transcendental execution units.  The most common use of special function is applying some scale factor to an entire vertex or pixel (e.g. normalizing or rotating an object).  AMD GPUs have a roughly 4:1 ratio, while Nvidia’s vary between 4:1 and 8:1.  In contrast, Ironlake’s 12:1 ratio seems too low for good performance, while Sandy Bridge should be just right.
回复 支持 反对

使用道具 举报

4#
 楼主| 发表于 2011-8-10 02:06 | 只看该作者
Texture Sampling Pipeline

One of the big differences between Intel’s graphics and the high powered architectures from AMD and Nvidia is the memory pipeline.  To span the entire graphics market, the memory hierarchy must scale with the number of shader cores; roughly a factor of 12X.  However, Intel’s target market is much narrower – Sandy Bridge graphics only has two models with ~2X difference in performance.

Intel’s architects designed a shared memory pipeline that entirely sits outside the cores, in contrast to the approach taken by AMD and Nvidia.  The memory hierarchy for Gen 5 and Gen 6 graphics is accessed by the shader cores entirely through the messaging framework.  The main components are a single texture sampling engine and a data port that controls all other caches and the render output pipeline.  As Figure 6 shows, AMD's texturing pipelines and L1 texture caches are distributed within each shader core, rather than centralized - and the L2 texture caches are replicated for each memory channel.




Figure 6 – Texture Sampling Pipeline Comparison


Both Ironlake and Sandy Bridge have a virtual memory model that relies on 4KB pages, to ensure compatibility with x86, in a two level translation structure.  Both linear and 4KB tiled address spaces are supported; the latter is essential for rectangular buffers that are common in graphics.  Up to 2GB of memory can be mapped and all pages must be locked so that they cannot swap to disk.  The graphics page tables indicate if a given page is snooped/coherent (system memory), or un-snooped (main memory) and a special global page table is used for memory that can also be accessed by the CPU.


Sampling


The texture sampling is a read-only memory pipeline used for both graphics and media applications.  The sampling engine receives commands and co-ordinates from the cores through the messaging framework.  The texture addressing unit takes the base co-ordinates of four pixels and will generate up to 32 texel addresses based on the mode and level of anisotropic filtering.  The sampler handles 4 component packed data that is 8-bit, 16-bit or 32-bit (corresponding to 32-bit, 64-bit and 128-bit wide texels).

The texture accesses probe the set-associative texture caches; the L1 texture cache is 4KB and backed up by a larger 16KB L2 cache.  These caches are read-only and are explicitly managed by the driver.  The texel data is optionally filtered to yield color values.  The pipeline can gamma correct textures and also selectively change the sampled texels to be black or transparent, based on the color values.  This technique is referred to as chroma keying and is used for compositing.  Table 2 shows the performance for Sandy Bridge’s texture sampling pipeline.




Table 2 – Sandy Bridge Texture Sampling Performance


As the table shows, the texturing throughput heavily depends on the data format and filtering complexity.  Although Intel did not disclose the precise bandwidth from the L1 texture cache, the overall sampling performance implies 128B/cycle of bandwidth.  Interestingly, the throughput for 128-bit texture data is 2X lower than expected for bi-linear and tri-linear filtering. If texture cache bandwidth were the only limitation, 128-bit textures should run at half the speed of a 64-bit format – suggesting that the 32-bit filtering (rather than texture lookup) may be the bottleneck.  The anisotropic filtering (AF) performance is relatively low, but only the pixels that need perspective correction will actually use anisotropic filtering.  Most pixels will use bi-linear or tri-linear.  

Beyond graphics, the texturing is also used extensively for encoding and decoding media.  The sampling engine contains fixed function hardware that can apply denoise filtering to clean up a video stream.  There is also a block that detects and corrects video interlacing to avoid lower quality interpolation of interlaced frames.  Lastly, video scaling and image enhancement use the texturing pipeline.  The adaptive video scaler applies an 8x8 sharpening filter and a bi-linear smoothing filter, and then blends the two together to produce a final output.  These techniques can also be applied to a static image as well.

回复 支持 反对

使用道具 举报

5#
 楼主| 发表于 2011-8-10 02:08 | 只看该作者
Render Output Pipeline        

The other prominent component of the memory hierarchy is the render output pipeline (ROP), which is capable of both reads and writes and resides in the data port.  Sandy Bridge has a single ROP pipeline, again because the performance only scales modestly between different parts.  In contrast, GPUs from AMD and Nvidia tend to partition the ROPs, with a pipeline for each channel of external memory.  As shown in Figure 7, Llano has two ROP partitions.

The ROP is responsible for writing out render targets and performs a variety of critical graphics operations including alpha testing, stencil testing, depth testing and blending the pixel output.  Like most GPUs, the Gen 6 ROP is generally optimized for writing out data, rather than reading, and must include many related functions such atomic operations.  A message to the ROP typically contains 4 quads (16 pixels total) that will be written back to one or more render targets.

Both Ironlake and Sandy Bridge include render caches that are used for read/write operations, particularly in the ROP.  The depth and color caches are respectively 32KB and 8KB, both set-associative.  The output from these caches are eventually sent over the ring interconnect and to Sandy Bridge’s L3 cache and/or memory controller.  The render caches are not coherent with the texture caches by default.  To safely read back data, the render caches must be explicitly flushed.



Figure 7 – Render Output Pipeline Comparison


The ROP in Sandy Bridge was also substantially improved over the previous generation with better performance and higher quality graphics options.  As an example of the trend towards fixed function graphics hardware at Intel, alpha coverage generation shifted from threads on the shader cores (in Ironlake) to dedicated hardware in the ROP (for Sandy Bridge).

Ironlake introduced hierarchical Z compression, to save memory bandwidth when handling depth buffers, and a fast Z-clear as well.  The Gen 6 ROP has significantly better hierarchical Z performance, by operating on larger tiles and further reduced the cost of clearing different buffers.  Early Z-testing (prior to pixel shading) is now mandatory, rather than optional, as it was for Ironlake.

The most significant change in the Sandy Bridge ROP is in image quality, rather than performance.  Previous generations did not support anti-aliasing, which helps to smooth out lines and jagged edges in rendered images.  Sandy Bridge has 2X and 4X multi-sample anti-aliasing (MSAA) with 32-bit FP blending, which is required for DirectX 10.1, and has been standard for AMD and Nvidia for a decade or more.  The ROP can write four 32-bit pixels per clock, for a total throughput of 16B/cycle – however, multi-sampling reduces the performance considerably.  The MSAA blending hardware can also be used for atomic operations, which will be handy for future versions that are OpenCL compliant.

Data Port

The data port is responsible for all memory accesses outside the texturing pipeline.  Most prominently this includes the ROP and render caches.  But the data port also encompasses the constant cache and access to the ring interconnect (i.e. the shared LLC and memory controller).  As with most shared resources, the data port is accessed through the messaging framework.  Sandy Bridge is the first GPU with access to the LLC, and it is a tremendous advantage over Ironlake.

One of the most important changes in the data port is the memory ordering.  Previously, the data port had no ordering between messages; software was responsible for ensuring that two messages did not attempt to simultaneously read and write the same location.  Sandy Bridge moves a step in the right direction and guarantees that read and write commands from each thread will be handled in-order.  There is still no hardware ordering between different threads by default, but that is normal for most GPUs.  The stronger memory ordering model is important for future generations that will have OpenCL and Direct Compute – an in-order model is much more natural for developers.  It is also a boon to tighter integration, since it more closely matches the x86 ordering model.

The Sandy Bridge data port has several new capabilities.  The first is an unaligned 16B block read, which can access 16B, 32B, 64B or 128B of contiguous data and writes back to 1-4 GRF entries (depending on the total size).   The second takes advantage of the new blending hardware in the ROP for atomic operations.  The message executes 8 atomic operations on 32-bit data, including arithmetic and logic, compare and exchange and min/max, and writes the results back to 8 different locations memory.  The writes can be totally non-contiguous in memory, and the current generation will not attempt to coalesce the accesses.  This is another forward looking change that is necessary for programmability – both standards require some form of atomic operation.

回复 支持 反对

使用道具 举报

6#
 楼主| 发表于 2011-8-10 02:09 | 只看该作者
Heterogeneous Integration

The graphics integration in Sandy Bridge is particularly novel as Intel is sharing the LLC with the GPU. The driver allocates regions of the cache at way granularity (128KB) – and can actually request the whole cache. Each thread can spill 32KB of data back to the LLC, for a total of nearly 2MB in the larger 12 shader core variants. Almost any GPU data can be held in the LLC, including vertices, textures and many other types of state.

The Sandy Bridge LLC and ring interconnect can rapidly pass data from the GPU back to the CPU – AMD’s Fusion is a far higher performance GPU, but that particular style of communication is discouraged. Since the GPU has a weaker ordering model, a flush command is needed to force data to be written back to the LLC prior to the CPU reading it. The driver can also allocate a portion of the LLC as a non-coherent cache for display data and other uses. For example, the results of transcoding might be written out to the the non-coherent region.

While this excellent system integration promises many benefits, at present it is restricted mainly to multimedia workloads. For graphics, it is largely an academic advantage to any but Intel’s driver team. The GPU is exposed through graphics APIs; yet neither OpenGL nor DirectX programs can interact with coherent memory and bypass I/O copies (let alone use the LLC). AMD has introduced an OpenCL extension for a zero copy mechanism on Windows systems already, and presumably Intel will follow once they have OpenCL and DirectCompute capable hardware. Intel’s graphics driver can take advantage of fast CPU/GPU communication, but that is only because it has raw access to the GPU hardware. These advances pave the way for Ivy Bridge and certainly promise good things in the future, but also serve to point out some of the deficiencies in the current generation.

The power controller (PCU) resides in the system agent (along with the DDR3 memory controller) and manages the frequency and voltage for different regions on the chip. The cores, ring interconnect and LLC are all on a single power and frequency plane, although each can be individually power gated. The GPU resides on a different power and frequency domain, as does the system agent. However, the power budget for Sandy Bridge is managed in a unified manner. So if the CPU cores are idle, the power and thermal headroom can be used by the GPU for turbo mode (and vica versa). As a result, the GPU base frequency is a relatively normal 850MHz for a high-end part. But the peak clock speed is an impressive 1.35GHz for the entire GPU, including the command streamer, setup engine and ROPs.

There is a down side to this arrangement though. The GPU is so tightly integrated into the system that it relies on the ring interconnect, LLC and memory controller for operation. But the frequency for the ring and LLC is determined by the cores, which may be running at a lower P-state than the GPU. For example, if the CPU cores are lightly loaded and the PCU switches them to a reduced voltage and frequency, then the bandwidth across the ring decreases as well – negatively impacting the performance of the GPU. This will almost certainly be fixed in Ivy Bridge by separating out the clock trees for the ring.

Conclusions

The Sandy Bridge Gen 6 graphics is a huge improvement for Intel and the PC ecosystem. It is the first graphics product that has taken advantage of Intel’s core competency in semiconductor manufacturing, using their cutting edge 32nm process technology. The system integration in Sandy Bridge is quite advanced and is a roadmap for the rest of the industry, namely AMD. In particular, it is clear that sharing the last level cache and unified power management are hugely beneficial to performance and power efficiency.

The graphics performance is good and overall seems to be about a 2X improvement over the Ironlake generation, which puts many games above the 30 frames/second mark that is key for playability. As earlier reviews showed, the performance is actually better than some entry level discrete graphics cards and with better power consumption. In practice the performance depends on factors such as the number of shader cores, size of the LLC and frequencies – which vary considerably from model to model. While the hardware is impressive, there are still glaring software deficiencies. The texture filtering seems to be lower quality than AMD and Nvidia’s implementations. Gen 6 also does not support OpenCL or DirectX 11 – this is understandably due to scheduling, but still a weakness.

Competitively, Sandy Bridge’s graphics was quite impressive at introduction in January 2011, but lags behind AMD’s Llano, which was launched in the middle of 2011. The Llano GPU is essentially twice as fast, which is not surprising given that it is also twice the die area. Moreover, Llano has full support for OpenCL 1.1 and DX11.

However, the multi-media capabilities of Sandy Bridge are industry leading. The video encoding performance is 7-9X higher than Ironlake or Llano, largely due to the fixed function hardware in the GPU. Additionally, these capabilities are accessible to 3rd party software developers through Intel’s Media SDK and have been adopted in quite a few applications.

Overall Sandy Bridge’s GPU is a welcome step forward for Intel, but a mixed bag. It is the best solution for multimedia, which is arguably the most common GPU workload. However, it does not truly match AMD’s graphics; rather it narrows the gap significantly. It will fall to future generations, such as the 22nm Ivy Bridge and Haswell to further close the gap both in terms of hardware and software.

The overall state of the industry and some of the deficiencies in Sandy Bridge hint at improvements for Ivy Bridge. Ivy Bridge will be a totally redesigned GPU with OpenCL, DirectCompute and DX 11 support and possibly extensions for task level parallelism. The shader cores will increase and probably be redesigned with some sort of shared memory or cache and greater execution resources. The floating point support will hopefully improve to full IEEE 754 compliance. The clocking of the ring interconnect will likely be independent of the CPU cores, so that the last level cache can run at peak performance for the GPU. Video encoding will probably be improved as well, perhaps with more fixed functions and broader codec support.

However, the most important area for improvement is Intel’s software environment, including both drivers and the overall programming model. A good software ecosystem is critical to efficiently leveraging heterogeneous resources. In particular, approaches that empower developers to easily share data (rather than explicitly copying) are ideal for performance and power efficiency. Ivy Bridge will undoubtedly be disclosed at IDF later this year, giving everyone plenty of time to ponder the changes before the first products arrive in 2012.
回复 支持 反对

使用道具 举报

cellwing 该用户已被删除
7#
发表于 2011-8-14 15:51 | 只看该作者
提示: 作者被禁止或删除 内容自动屏蔽
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

广告投放或合作|网站地图|处罚通告|

GMT+8, 2025-6-20 05:29

Powered by Discuz! X3.4

© 2001-2017 POPPUR.

快速回复 返回顶部 返回列表