|
|
Arithmetic Instructions throughput performance
To issue one instruction for a warp, a multiprocessor takes:
# 2 clock cycles for floating-point add, floating-point multiply, floating-point multiply-add, integer add, bitwise operations, compare, min, max, type conversion instruction;
# 8 clock cycles for reciprocal, reciprocal square root, __log(x) (see Table A-2).
32-bit integer multiplication takes 8 clock cycles, but __mul24 and __umul24 (see Appendix A) provide signed and unsigned 24-bit integer multiplication in 2 clock cycles. Integer division and modulo operation are particularly costly and should be avoided if possible or replaced with bitwise operations whenever possible: If n is apower of 2, (i/n) is equivalent to (i>>log2(n)) and (i%n) is equivalent to (i&(n-1)); the compiler will perform these conversions if n is literal.
Other functions take more clock cycles as they are implemented as combinations of several instructions.
Floating-point square root is implemented as a reciprocal square root followed by a reciprocal, so it takes 16 clock cycles for a warp.
Floating-point division takes 18 clock cycles, but __fdividef(x, y) provides a faster version at 10 clock cycles (see Appendix A).
__sin(x), __cos(x), __exp(x) take 16 clock cycles.
Sometimes, the compiler must insert conversion instructions, introducing additional execution cycles. This is the case for:
# Functions operating on char or short whose operands generally need to be converted to int,
# Double-precision floating-point constants (defined without any type suffix) used as input to single-precision floating-point computations,
# Single-precision floating-point variables used as input parameters to the double-precision version of the mathematical functions defined in Table A-1.
The two last cases can be avoided by using:
# Single-precision floating-point constants, defined with an f suffix such as 3.141592653589793f, 1.0f, 0.5f,
# The single-precision version of the mathematical functions, defined with an f suffix as well, such as sinf(), logf(), expf().
For single precision code, we highly recommend use of the single precision math functions. When compiling for devices without native double precision support, the double precision math functions are by default mapped to their single precision equivalents. However, on those future devices that will support |
|