NVIDIA的reliable GPU设计

Edison · 发表于 2007-7-6 01:32

随着GPGPU的日渐推广，GPU运算结果的可靠性问题也逐渐受到大家的重视，NVIDIA研究院和维吉尼亚大学发表了一篇关于增强GPU运算可靠性的论文。

摘选：

ECC in memory systems for these critical applications is a necessity, but it is not sufficient. An error can occur in control or logic that silently and undetectably corrupts computation outside of the auspices of any memory protections. Transient errors in logic are not yet prevalent, but rates are increasing exponentially with each generation and logic errors are expected to become a significant concern in practice within the next three to five years。

To achieve reliability, logical structures must be protected with redundancy. Possible ways to build a redundantly reliable GPU-based system include replicating the entire computation (temporal redundancy) or using two GPUs (spatial redundancy) and comparing the result. Both of these involve a 2× overhead, either in time or space respectively, plus comparison time, in the expected, no-error case. Architectural solutions place the redundancy on-chip, but must answer more complicated questions about what hardware will fall within or without the sphere of replication, and how to ensure that the likelihood of a silent data corruption is minimized.

The Folding@Home GPU client has now been in distribution for 6 months, running on approximately 500 GPUs. Over this sample set, [email=Folding@Home]Folding@Home[/email] has shown a failure rate of approximately 1% [Hou07]. This number may seem high; however, note that [email=Folding@Home]Folding@Home[/email] users are competing to complete work units, thus many overclock their GPUs, which certainly impacts reliability.

Fully hardware solutions focus their efforts on reducing the temporal and spatial overhead of error detection. Chip multiprocessors (CMPs) bring the idea of a Chip-level
Redundantly Threaded Processor or CRT, which provides some minimal hardware support for redundant multithreading on a CMP. A slightly more sophisticated approach is
found in Simultaneously and Redundantly Threaded Processors or SRTs [RM00,MKR02], which take advantage of hardware multithreading, but not of multiple cores.

There are several reasons why memoization [PGS04], or caching multiply computed results for reuse in a reliable store, is not suitable for application to graphics processors.
The most important of these is that it depends upon the existence of complex decode logic to cover the latency of a cache access. This logic simply is neither present nor desirable in current GPU architectures. Furthermore, the technique is engineered for temporally redundant computation, but as is presented in Section 3, a reliable GPU solution is far more likely to use spatial redundancy rather than temporal redundancy.

We implement a design for a redundantly reliable graphics processor with the explicit intent that this reliability is intended for GPGPU domains.We have previously shown that the type and level of reliability described in this paper is not necessary for general graphics [SLS06]. Specifically, we provide a redundancy mechanism for the fragment engine, other stages of the pipeline having small AVFs and therefore being of little import to GPGPU applications. In addition to providing a redundancy mechanism, we work under the additional constraint that a solution should require a minimal set of changes to existing hardware—we seek a solution that is zero-cost in terms of the performance overhead when processing graphics workloads and nearly zero-cost with respect to die space sacrificed to implement the solution—this means using existing logic and data paths whenever possible.

Quads (2×2 arrays of fragments) and Warps (NVIDIA’s term for a minimum set of threads for SIMD execution in CUDA) both represent minimum SIMD execution blocks in their respective domains. Implementing redundancy over quads or warps is essentially equivalent to shader core redundancy, the primary difference being the size of the replicated computation blocks, i.e. pairs of quads or warps work in parallel on replicated data instead of pairs of cores, complicating the dual issue semantics and restricting its flexibility. Again, a lockstepping and cache sharing mechanism might be plausible here, but seem less likely than at the shader core, as well as highly dependent on current hardware organization.

We add a domain buffer to store data needed to set up the rasterizer for the domain in the event of a reissue, augment the rasterizer to produce two of each fragment and to handle reissue requests from the domain buffer, and repurpose raster operations as our comparator. In the figure, “VS” is the vertex stream, “GP” is the geometry processor, “DB” is the domain buffer, “FC” is a fragment core, and “FB” is the framebuffer.

Our implementation demonstrates that a reliable GPU built as described in this paper benefits greatly from increased memory locality inherent in the double issue method, allowing it to perform much better than the naïve expected overhead of 2×. In fact, our simulations show a measured overhead of less than 1.5× on most of our problem
domains.

http://www.cs.virginia.edu/~skadron/Papers/sheaffer_gh2007.pdf

阳光不锈 · 发表于 2007-7-6 12:29

搞什么啦有翻译没有

帐号		自动登录	找回密码
密码			注册

NVIDIA的reliable GPU设计

浏览过的版块