|
http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=1&p=1&f=G&l=50&d=PG01&S1=%28%22Parallel+Array+Architecture+Graphics+Processor%22.TTL.%29&OS=TTL/"Parallel+Array+Architecture+for+a+Graphics+Processor"&RS=TTL/"Parallel+Array+Architecture+for+a+Graphics+Processor"
Timothy Farrar 在上面这个专利中找到了一些有趣的东西,虽然未必和 NVIDIA 的下一代架构直接相关,不过其中描述到的一些细节是值得大家思考的:
1. Parallel Fixed Function Units - This patents refers to optional duplication of all fixed function hardware (setup, raster, etc) for an "increase in the processing power or redundancy of the system".
2. Possible Writable on Chip Memory - CUDA's shared memory is referred to as shared register file in the patent. On chip memory refers to cached read-only uniforms and constants in a previous patent, but this patent adds, "On-chip memory is advantageously used to store data that is expected to be used in multiple threads, such as coefficients of attribute equations, which are usable in pixel shader programs, and/or other program data, such as results produced by executing general purpose computing program instructions". Might not mean writable, the results of GP computing could have been from a past kernel call, however does certainly leave the door open for a writable cache.
3. General Purpose Instruction Issue - Current (or perhaps older) GL|DX to CUDA required an expensive "context switch" to my knowledge. Could be a driver issue, or could be that older hardware wasn't able to interleave execution of programs beyond vert/geo/pixel shaders, I don't know. DX11 has a good number of different shader types in the rendering pipeline, so no doubt the hardware is going to have to be able to efficiently issue instructions from multiple different programs/kernels. One wouldn't want too many running at the same time on a given core (because cache pressure increases), but enough to keep the pipeline going, to avoid bubbles during draw call transitions, and possibly support interleaving a mix of ALU and MEM heavy programs to keep hardware utilization high in all ways. Seems as if this patent supports this.
4. No MIMD - Do NOT see anything direct in the patent about Multiple Instruction Multiple Data execution. Instruction issue is SIMD to PEs, "for any given processor cycle the same instruction issued to all P processing engines [PEs]". Where P is the SIMD width. However this patent adds "instruction unit may issue multiple instructions per processing cycle". Could multiple issue enable PE's to keep higher ALU efficiency when PE's are predicated? The patent says, "a SIMD group may include fewer than P threads, in which case some of the processing engines will be idle during cycles when that SIMD group is being processed", perhaps not.
5. Supergroup SIMD -> Possible Lower Overhead or Dynamic Warp Formation? - From the patent, "SIMD groups containing more than P threads (supergroups) can be defined. A supergroup is defined by associating the group index values of two (or more) of the SIMD groups with each other. When issue logic selects a supergroup, it issues the same instruction twice on two successive cycles". This can better amortize the instruction issue cost over more instructions, and can be used in combination with double clocking the PEs. I see three possible cases here. (a) That this is in the GT200 (and maybe previous generations), 8-wide SIMD doubled clocked to 16-wide half-warp, and hardware supergroup to 32-wide warp. Maybe with limitations that groups need to be sequential in order for the supergroup. (b) Or that the half-warp to 32-wide warp was done with paired instructions on current hardware, and this is just a more efficient hardware path. (c) Or that with future hardware (possibly GT3xx) this is a way to increase SIMD efficiency through dynamic warp formation. The idea being if keeping with 8-wide SIMD and a double clocked ALU, instruction issue could pick a hi/lo pair of 16-wide SIMD half-warps to issue based on common instruction. Clearly other granularities and options are possible.
6. Work Distribution - Patent leaves open many options, "shaders of a certain type may be restricted to executing in certain processing clusters or in certain cores", or "in some embodiments, each core might have its own texture pipeline or might leverage general-purpose functional units to perform texture computations". Data distribution covered in multiple forms. Broadcast, "receive information indicating the availability of processing clusters or individual cores to handle additional threads of various types and select a destination". Ring, "input data is forwarded from one processing cluster to the next until a processing cluster with capacity to process the data accetps the data". Function of input, "processing clusters are selected based on properties of the input data".
shared register 其实也是 DX11 中 CS 对这类 memory 的称呼, 而这里 Work Distribution 则更是带有明显 DX11 特征。 |
|