|
http://www.tml.tkk.fi/~timo/publications/aila2009hpg_paper.pdf
这篇论文是 NVIDIA 将在下月的 HPG 2009 上发布的论文,题目为 Understanding the Efficiency of Ray Traversal on GPUs。
根据这篇论文,NVIDIA 的两位研究员认为影响 Ray Traversal 效率的并非内存子系统,而是 hardware work distribution 难以辨明的低效率,有鉴于此,他们提供了一个简单的解决方案显著地缩减了理论模拟与实际测试的性能差距,从而达成迄今为止最快的 GPU Ray Tracer。
以下是结尾部分:
We have shown that the performance of fastest GPU trace() kernels
can be improved significantly by relying on persistent threads instead
of the hardware work distribution mechanisms. The resulting
level of performance is encouraging, and most importantly the less
coherent ray loads are not much slower than primary rays. It seems
likely that other tasks that have heterogeneous workloads would
benefit from a similar solution.
In additional tests we noticed that in these relatively simple scenes
the performance of our fastest speculative kernel remains around
20–40Mrays/sec even with randomly shuffled global illumination
rays. These ray loads are much less coherent than one could expect
from path tracing, for example, so it seems certain that we would be
able to sustain at least that level of performance in unbiased global
illumination computations.
We have also shown that, contrary to conventional wisdom, ray
tracing performance of GTX285 is not significantly hampered by
the lack of cache hierarchy. In fact, we can also expect good scaling
to more complex scenes as a result of not relying on caches.
However, we do see the first signs of memory-related effects in the
fastest speculative while-while kernel in diffuse interreflection rays.
In these cases a large cache could help.
Additional simulations suggest that a 16-wide machine with identical
computational power could be about 6–19% faster than 32-
wide in these scenes, assuming infinitely fast memory. The difference
was smallest in primary rays and largest in diffuse, and
it also depended on the algorithm (min/average/max (%)): whilewhile
(9/14/19), speculative while-while (6/10/15), if-if (8/10/13),
speculative if-if (6/8/12). This suggests that speculative traversal is
increasingly useful on wider machines. Theoretically a scalar machine
with identical computational power could be about 30% (primary)
to 144% (diffuse) faster than 32-wide SIMD with the used
data structures, again assuming infinitely fast memory. |
|