|
在 G100 的 "dual issue" 实现上, RWT 也有比较清晰的阐述:
Dual 'Issue'
One of the interesting complexities of NVIDIA’s microarchitecture is the relationship between latency and throughput. In general, CPUs execute most operations in a single cycle - but the latency of the fastest operation for an SP core is 4 cycles. Since the SM can issue one warp instruction every 2 'fast' cycles, it should be possible to have multiple instructions in flight at once. In fact, this ability is what NVIDIA refers to as ‘dual issue’ – although in reality it is simply parallel execution across functional units. The SP cores execute one instruction over 4 cycles, and other execution units can be in use, processing a different warp instruction simultaneously.
![]()
Figure 6 – Dual 'Issue' for NVIDIA's Execution Units
As Figure 6 illustrates, the SM can issue a warp instruction every 2 clocks. In the first cycle, a MAD is issued to the FPU. Two cycles later, a MUL instruction is issued to the SFU. Two cycles after that, the FPU is free again and can execute another MAD. Two cycles after that, the SFU is free and can begin to execute a long running transcendental instruction. Using this technique, the computational throughput of the shader core is increased by 50%, while retaining the simplicity of issuing only one warp every 2 cycles, which simplifies the scoreboarding logic. Not all combinations can be executed in parallel. For instance, the double precision unit and the single precision units share logic and cannot be active simultaneously as a result. |
|