GS not designed for large-expansion algorithms like tessellation
Due to required ordering and serial execution
See Andrei Tatarinov’s talk on Instanced Tessellation
Remember you don’t need to use a GS if you are just processing vertices
Be aware of appropriate ALU to TEX hardware
instruction ratios:
4 5D-vector ALU per TEX on AMD [AMD承认是5D Vector,不再像press release的时候说是scalar了]
10 scalar ALU per TEX on NVIDIA GeForce 8 series
Check for excessive register usage
> 10 vector registers is high on GeForce 8 series [GF8存在大约10个顶点寄存器的时候,shader性能会受到影响的现象]
Simplify shader, disable loop unrolling
DX compiler behavior may unroll loops so check output
AMD: Clears
Always clear Z buffer to enable HiZ
Clearing of color render targets is not free on
Radeon HD 2000 and 3000 series
Cost is proportional to number of pixels to clear
The less pixels to clear the better!
Here the rule about minimum work applies:
Only clear render targets that need to be cleared!
Exception for MSAA RTs: need clearing every frame
RT clears are not required for optimal multi-GPU usage
AMD: Depth Buffer Formats
Avoid DXGI_FORMAT_D24_UNORM_S8_UINT for
depth shadow maps
Reading back a 24-bit format is a slow path
Usually no need for stencil in shadow maps anyway
Recommended depth shadow map formats:
DXGI_FORMAT_D16_UNORM
Fastestshadow map format
Precision is enough in most situations
Just need to set your projection matrix optimally
DXGI_FORMAT_D32_FLOAT
High-precision but slower than the 16-bit format
NVIDIA: Clears
Always Clear Z buffer to enable ZCULL
Always prefer Clears vs. fullscreen quad draw calls
Avoid partial Clears
Note there are no scissored Clears in DX10,they are only possible via draw calls
Use Clear at the beginning of a frame on any rendertarget or depthstencil buffer
In SLI mode driver uses Clears as hint that no inter-frame dependency exist. It can then avoid synchronization and transfer between GPUs
NVIDIA: Attribute Boundedness
Interleave data when possible into a less VB streams:
at least 8 scalars per stream
Use Load() from Buffer or Texture instead
Dynamic VBs/IBs might be on system memory accessed over PCIe:
maybe CopyResource to USAGE_DEFAULT before using (especially if used multiple times in several passes)
Passing too many attributes from VS to PS may also be a bottleneck
packing and Load() also apply in this case
NVIDIA: Depth Buffer Formats
Use DXGI_FORMAT_D24_UNORM_S8_UINT
DXGI_FORMAT_D32_FLOAT should offer very similar performance, but may have lower ZCULL efficiency
Avoid DXGI_FORMAT_D16_UNORM
will not save memory or increase performance
CSAA will increase memory footprint
NVIDIA: ZCULL Considerations
Coarse Z culling is transparent, but it may underperform if:
If depth test changes direction while writing depth (== no Z culling!)
Depth buffer was written using different depth test direction than the one used for testing (testing is less efficient)
If stencil writes are enabled while testing (it avoids stencil clear, but may kill performance)
If DepthStencilView has Texture2D[MS]Array dimension (on GeForce 8 series)
Using MSAA (less efficient)
Allocating too many large depth buffers (it’s harder for the driver to manage)