|
"Godson-2 has two fix-point functional units, two floating-point functional units, and one memory accessunit. The floating-point units can also execute 32- or 64-bit fix-point instructions and 8- or 16-bit SIMD fix point instructions through extension of the fmt field ofthe floating-point instructions."
"The basic pipeline stages of Godson-2 include instruction fetch, pre-decode, decode, register rename, dispatch,issue, register read, execution, and commit. Fig.lshows major sections of Godson-2."
"Godson-2 implementsthe merged approach and has a 64-entry physical register file for fix-point and floating-point register rename respectively. Correspondingly, two 64-entry physical register-mapping tables (PRMT) are maintained to build the relationship between physical and architectural registers."
"Godson-2 has two independent group reservation stations. Fix-point and memory instructions are sent tothe fix-point reservation station. Floating-point instructions are sent to the floating-point reservation station.Each reservation station has 16 entries and can acceptas many as four instructions per cycle."
"Godson-2 has one fix-point physical register file and one floating-point physical register file, both with the size of 64 x 64."
"ALU1 executes fix-point addition, subtraction, logical,shift, comparison, trap, and branch instructions. All ALU1 instructions are executed and written back in one cycle."
"ALU2 executes fix-point addition, subtraction, logical,shift, comparison, multiplication, and division instructions.Fix-point multiplication is fully pipelinedand has a latency of four cycles. Fix-point division usesthe SRT algorithm and is not fully pipelined, the latencyof fix-point division ranges from 4 to 37 cycles dependingon the operands. All other ALU2 instructions canbe executed and written back in one cycle."
"The fully pipelined FALU1 executes floating-point addition, subtraction, absolute, negation, conversion,comparison, and branch instructions. The floating-point absolute, negation, comparison and branch are two-cycle instructions, while the latency of floating-point addition,subtraction, and conversion instructions is four-cycle."
"FALU2 executes floating-point multiplication, division,and square root instructions. The fully pipelined floating-point multiplication uses two-bit Booth-encoded Wallace tree algorithm and has a latency of five cycles.The division and square root use the SRT algorithm andare not fully pipelined. The latency of single/double precision floating-point division ranges from 4 to 10/17 cycles,the latency of single/double precision floating-point square root ranges from 4 to 16/31 cycles, depending onthe operands."
"Besides executing all MIPS III floating-point instructions,the floating-point functional units can also execute paired-single floating-point instructions which calculate two single precision operations (addition, subtraction and multiplication) in the 64-bit datapath, 32- or 64-bit fix-point instructions (arithmetic, logic, shift, compare,and branch), and 8- or 16-bit SIMD fix-point instruction through extension of the fmt field of the floating-point in structions."
"The interface of the Godson-2 processor supports split read and R5000 like external level two cache. The size of the external cache ranges from 256KB to 8MB."
"Loads and stores enter the queue out-of-order, but an in-order architectural memory model is maintained. Multiple cache misses and hit under miss are allowed."
"Godson-2 does not retry a memory access in case of cache miss or hazards. Using a physical address CAM, the memory access queue dynamically performs disambiguation and forwarding between accesses. When a load enters the queue, it checks all the older stores for possible bypass for each byte it needs. When a store enters the queue, it checks all the younger loads in front of another tyounger store to the same byte in the queue to decide whether to forward value to them. The queue also snoops cache refill and replace operations."
"The queue has four read ports. The first read port is used to select first result-ready instruction and write back its result. Cache hit loads are written back even when there is pending stores before it. If late coming store should forward its value to the speculatively written back load, the load and its following instructions will be cancelled. The second read port is used to select the first committed write-ready store and write its value to data cache. A store is write-ready when the value to store is valid and it has been committed (that is, cannot be cancelled). The third read port is used to issue miss request to the next level memory. Uncached accesses and exception handling use the last read port."
"Our future work includes implementing a special Java co-processor and exploiting multithreading parallelism through putting multiple processors in the same chip." |
本帖子中包含更多资源
您需要 登录 才可以下载或查看,没有帐号?注册
x
|