POV-Ray on Nvidia g80

Edison · 发表于 2008-5-13 18:24

作者：

Jonathan Bruneau Rob Kost
Dept. of Electrical and
Computer Engineering
University of Colorado at Boulder
{jonathan.bruneau, kost,jackson.marusarz}@colorado.edu

摘要：

"Recent advances in ray tracing technology have mitigated the slow computational nature of ray tracing. However, raytracing is still one of the most time consuming operations on the CPU. This paper presents a novel approach to porting POV-Ray to the NVidia g80 series graphics card. While it would be quite daunting to port the entire POVRay application to the GPU, porting the computational intensive segments is more readily implementable. The method we use strives to achieve speedups by efficiently utilizing the highly parallel nature of the GPU. We exploit parallelism and increase speed by executing the ray intersections in parallel on the CPU and GPU, where the CPU buffers the ray tracing intersection requests to theGPU and handles all other bookkeeping tasks. To date, no one has ported POV-Ray to the GPU. Using an NVidia 8600GT, our approach attempts to improve performance by 10%, with a theoretical upper bounded of 290%."

"Ray tracing is a technique for designing computer graphics by modeling the actions of light beams in a scene of objects. It can simulate very complicated global illumination effects such as reflection, refraction, shadows and radiosity, to name a few. However, it comes at a very high computation cost and is classically one of the most time consuming operations on a CPU [1]. A file, known as the scene, represents all of the information about the image and acts as the input to a ray tracing algorithm. The final picture to be displayed is a composition of every pixel and its color, as seen by the defined camera. This is known as the frame. Ray tracing computes the color for each pixel (x,y) in the frame by casting a ray from the camera through the frame at (x,y) and into the scene. The algorithm then calculates where the ray intersects objects in the scene and how it is affected based on object attributes. The algorithm finishes when a ray has been cast and calculated for every pixel [2]. "

"Ray tracing differs from many rendering techniques because it can render 3d images which contain very complex light interactions. The basic premise of ray tracing is to cast a ray into space for every pixel on the image. The algorithm then calculates whether that ray will intersect an object, bounce off that object and hit another one. In doing so, one can then easily model reflection and refraction of the rays by recursively following the path the light takes as it bounces off objects through space. This is unlike other rendering methods which are more of a simulation, which in contrast ray tracing produces photo realistic images. The recent g80 series of GPUs from NVidia have truly enabled General Purpose Computing on Graphic Processing Units (GPGPUs), allowing parallelizable applications to substantially increase their performance. In face of the poor performance of POV-Ray, there is a considerable motivation to port POV-Ray to the GPU. "

"Although the POV-Ray team is currently developing a multi-threaded port of their application (currently unavailable), there has been no mention of GPGPU optimizations of this port."

"To exploit the GPU, parallelism must be extracted from the application. In an optimal case, a thread of execution would be assigned for each pixel of the render buffer, and the remaining calculations would be performed for each pixel in parallel on the GPU. However, the loop that iterates across each pixel is very high up the rendering pipeline, and as such placing the entire ray tracing computation on the GPU requires a complete port of the POV-Ray ray tracing code - a monumental task. "

"An alternative is to have the CPU and GPU share computations. The majority of the code flow can be handled by the CPU, and the computationally expensive steps by the GPU. The majority of the expensive computations in POV-Ray can be found in the ray intersections routines. For each pixel, the ray intersection routines are called the very least twice: once to find the object intersection with the ray, a second time for lighting calculations (this case occurs for a 1 object scene lit by 1 light source). In normal circumstances, each object in the vista tree (a POV-Ray implementation of a Binary Space Partitioning (BSP) Tree) is checked for intersection, and once an intersection detected, another ray is cast for every light source. In both cases, the intersection routines are invoked."

"Using a parallel approach, ray tracing an image can speed up immensely. In particular, in a POV-Ray scene, there are a massive amount of intersection routines for each object. Since these intersections are not dependent on any other data, they can be parallelized"

"There is a notable difference between both of these rendered images. The main difference in quality of the two images in Figures 3 and 4 are due to POV-Ray using double floating point precision (64 bit), while current NVidia graphic cards are only single floating point (32 bit). Precision errors may appear to be trivial and should not diminish the image appearance. However, precision discrepancies cause many intersection requests to come back as not being intersected on the GPU, while on the CPU they resulted in being intersected. This can cause a decent amount of artifacts and images to appear granular."

"Performance Bottleneck

From the collected results above, it is clear that the GPU fails to outperform the CPU when rendering a POV-Ray scene. Although surprising, the reasons for the failure are well understood. In order to add GPU support to POVRay, many solutions to peripheral problems were implemented that considerably hampered the expected speed up. The three main issues are:

• A Reactive Request Deadlock is implemented

• Not all POV-Ray objects are executed in the GPU

• The GPU’s performance is bound by the CPU

Request Deadlock, as discussed in section 3.2, occurs on Request Buffer underflows. The current implementation uses a reactive solution, forcing data in input buffers to transfer to GPU memory and execute after a certain timeout period as opposed to proactively seeking deadlock conditions and preventing them. In the results discussed in section 4.1 and 4.2, this interval was set to 9ms. This number may not be optimally tuned and could be subject to more a judicious selection, at best dynamically selected based on a series of accrued statistics generated during the scene render. However, regardless of the type of tuning performed on the timer interval, performance will still suffer since input to the Request Buffer is non-uniform across an entire render of the scene. As such, in some instances the buffer will be forced to the GPU in spite of the fact that it is mostly empty. Little data is sent to the GPU, the maximal bandwidth is barely utilized, and performance degrades."

"Another factor causing performance losses is the design decision to only offload sphere intersection calculations to the GPU. Currently, the implementation uses a kernel for all sphere intersections but leaves other intersection calculations to the CPU. As a result, scene or section of scenes not readily using sphere fail to completely fill the Request Buffer and as such are bound to the Reactive Request Deadlock timer interval. To remedy the issue, new GPU kernels should be implemented for a wider variety of object types. Moving to multi-kernel GPU implementation, however, is partially problematic since the GPU must be managed, especially in the case when one kernel is executing while another one is ready to start. This problem is also seen in inter-process communication, where many processes all communicate with a single Request Buffer. A complex system of shared semaphores is required to eliminate race conditions between processes as well as to ensure uninterrupted GPU activity."

"Finally, the GPU’s performance is limited by the rate at which the CPU issues data to the Request Buffer. In an ideal situation, processes would fill the input buffer while the GPU writes calculated results back to system memory. Unfortunately, this is generally not the case. The CPU is buffering requests relatively slowly, and thus the GPU remains idle for long periods of time. These are wasted cycles since in the meantime the GPU could perform work in parallel. This later issue requires more intersection routines to be ported to the GPU and a drastic rewrite to the POV-Ray program."

http://eces.colorado.edu/~marusarz/POV-Ray_GPGPU.pdf

:loveliness:

gz_easy · 发表于 2008-5-13 18:45

Ray tracing要在消费级PC商业化不知何年何月。

帐号		自动登录	找回密码
密码			注册

POV-Ray on Nvidia g80

浏览过的版块