如果从 input 开始算,几百级并不奇怪,但是如果是指 shader core,可能也就是 20 级的规模,例如 CUDA Programming Guide 里提到"The delays introduced by read-after-write dependencies can be ignored as soon as there are at least 192 active threads per multiprocessor to hide them.",这样的话,SIMD 的 pipeline 应该可以看作是 24 个 stage。