|
一篇关于 GT200 的原子操作性能测试报道:
http://strobe.cc/articles/cuda_atomics/
Three memory access patterns will be tested. The first goes straight for the jugular: all writes across an SM go to the same address, ensuring that all atomic operations cause a conflict. Each SM gets its own address, though, because having all processors write to the same location caused several system crashes during testing. This is expected to be nearly the worst case for atomic operations, and the results do not disappoint:
![]()
Ick. Let’s not do that again.
The next access pattern is less pessimal; each memory location is separated by 128 bytes, and each thread gets its own memory location, ensuring that no conflicts occur but also preventing the chip from coalescing any memory operations.
![]()
Well, that’s… tolerable. It remains to be seen whether atomics can be used for scatters in computation threads, but this looks like it wouldn’t cause too much damage. One last access pattern: this time, all threads are neatly coalesced, each accessing a 4-byte memory location in order, such that a warp hits a single 256-byte-wide, 256-byte-aligned region of memory.
![]()
Crap. That’s quite a bit worse. Sure, the total latency for an atomic operation is better, but the ratio between an uncoalesced atomic and read-modify-write latency is much smaller than that for the coalesced pattern, so the relative cost of atomic operations in this context is much worse. |
|