|
SC08 上有一篇 University of California at Berkeley 的 "Benchmarking GPUs to Tune Dense Linear Algebra" 论文,其中的不少内容看来都是相当有趣,摘录如下:
对于 NV50 的 Strip Mining 向量执行方式,文章有这样的看法:
"Partitioning of long vectors into warps by the GPU environment corresponds to strip mining into independent instruction streams. This is an alternative to the more traditional strip mining into independent instructions in the same instruction stream. For example, an operation on a 512-element vector on a machine with VL = 32 is traditionally performed as 16 independent vector instructions. The GPU allows (but not requires) distributing these 16 independent instructions across 16 instruction streams. This is done to improve performance in branching — associating an individual program counter with a short subset of a longvector allows skipping branches not taken by this subset rather than masking them off.
However, strip mining into independent instruction streams is expensive as it requires replicating register data across all instruction streams in the thread. For example, a program operating on 512-element vectors consumes 2KB of register file per every pointer, temporary value or scalar value defined in the scalar thread as a 32-bit register variable.
Another associated overhead is the partitioning of the register data into private register spaces associated with different instruction streams in the thread. Accessing the data residing in the register space of another warp requires staging it via the local store, which incurs costs.
Note that the number of independent instructions supplied by a program does not depend on the kind of strip mining used.
Whether independent instructions come in the same or different streams, they hide memory and pipeline latencies. "
在 "Kernel Launch Overhead" 方面,文章作者测试结果如下:
"The minimum time to asynchronously invoke a GPU kernel using either the low-level or the high-level CUDA API was 3–7 μs across a variety of systems equipped with different GPUs, operating systems and CUDA versions. This was measured by asynchronously invoking the same kernel a very large number of times and synchronizing once at the end. The program used was the simplest possible, such as copying one word from one location in the GPU memory to another. This ensures that the program runtime does not contribute substantially to the overall time. The time increases to 10–14 μs when synchronizing at each kernel invocation. This shows the expense of synchronization.
To ensure that we do not sacrifice performance by choosing CUDA for programming the GPU we also measured the overheads in DirectX 9.0c, which is a mature graphics API widely used in computer games. The timings were 7–8 μs for invocation alone and 20–23 μs for invocation with synchronization (synchronization is required when computing with DirectX to ensure correctness, but not in CUDA). This indicates that CUDA is as efficient as or better than DirectX in terms of the
launch overhead."
在 "CPU-GPU Data Transfers" 方面,有这样的测试结果:
"Our primary system is equipped with a PCIe 1.1 x16 interface that bounds the bandwidth of the CPU-GPU link by 4 GiB/s. We found that transferring contiguous pieces of data with sizes from 1 byte to 100 MB long across this link using pinned memory takes about:
Time=11 μs+[(bytes transferred)/3.3GB/s]
This fits the measured data within a few percent. Similar fitting on other systems yielded similar accuracy with different numbers, such as 10–17 μs overheads and 2.2–3.4 GB/s bandwidths.
When using two GPUs in the system, transfer to the second GPU run only at up to 1.8GB/s, i.e. about what one may expect from PCIe 1.1 x8. This result was obtained when using various x16 slots in the nForce 680i SLI motherboard.
Operating with two GPUs concurrently poses new difficulties. CUDA requires attaching each CPU thread to a fixed GPU context, so multiple CPU threads must be created. According to our experience, pinning of memory is effective only with the GPU context that performed the memory allocation. Other GPU contexts perform at non-pinned rates when operating with this memory space. So, if two GPU contexts run transfers across the same main memory locations, at least one of the contexts will run at the non-pinned transfer rate, which is about 2x lower.
Benchmarks on a few different machines with PCIe 2.0 x16 have shown 3.9–6.1 GB/s transfer rates." |
|