MIPS-Like CPU，Godson-2微体系结构文档（英文）追加最新的godson2-SMT文档

Edison · 发表于 2006-6-25 10:33

"Godson-2 has two fix-point functional units, two floating-point functional units, and one memory accessunit. The floating-point units can also execute 32- or 64-bit fix-point instructions and 8- or 16-bit SIMD fix point instructions through extension of the fmt field ofthe floating-point instructions."

"The basic pipeline stages of Godson-2 include instruction fetch, pre-decode, decode, register rename, dispatch,issue, register read, execution, and commit. Fig.lshows major sections of Godson-2."

"Godson-2 implementsthe merged approach and has a 64-entry physical register file for fix-point and floating-point register rename respectively. Correspondingly, two 64-entry physical register-mapping tables (PRMT) are maintained to build the relationship between physical and architectural registers."

"Godson-2 has two independent group reservation stations. Fix-point and memory instructions are sent tothe fix-point reservation station. Floating-point instructions are sent to the floating-point reservation station.Each reservation station has 16 entries and can acceptas many as four instructions per cycle."

"Godson-2 has one fix-point physical register file and one floating-point physical register file, both with the size of 64 x 64."

"ALU1 executes fix-point addition, subtraction, logical,shift, comparison, trap, and branch instructions. All ALU1 instructions are executed and written back in one cycle."

"ALU2 executes fix-point addition, subtraction, logical,shift, comparison, multiplication, and division instructions.Fix-point multiplication is fully pipelinedand has a latency of four cycles. Fix-point division usesthe SRT algorithm and is not fully pipelined, the latencyof fix-point division ranges from 4 to 37 cycles dependingon the operands. All other ALU2 instructions canbe executed and written back in one cycle."

"The fully pipelined FALU1 executes floating-point addition, subtraction, absolute, negation, conversion,comparison, and branch instructions. The floating-point absolute, negation, comparison and branch are two-cycle instructions, while the latency of floating-point addition,subtraction, and conversion instructions is four-cycle."

"FALU2 executes floating-point multiplication, division,and square root instructions. The fully pipelined floating-point multiplication uses two-bit Booth-encoded Wallace tree algorithm and has a latency of five cycles.The division and square root use the SRT algorithm andare not fully pipelined. The latency of single/double precision floating-point division ranges from 4 to 10/17 cycles,the latency of single/double precision floating-point square root ranges from 4 to 16/31 cycles, depending onthe operands."

"Besides executing all MIPS III floating-point instructions,the floating-point functional units can also execute paired-single floating-point instructions which calculate two single precision operations (addition, subtraction and multiplication) in the 64-bit datapath, 32- or 64-bit fix-point instructions (arithmetic, logic, shift, compare,and branch), and 8- or 16-bit SIMD fix-point instruction through extension of the fmt field of the floating-point in structions."

"The interface of the Godson-2 processor supports split read and R5000 like external level two cache. The size of the external cache ranges from 256KB to 8MB."

"Loads and stores enter the queue out-of-order, but an in-order architectural memory model is maintained. Multiple cache misses and hit under miss are allowed."

"Godson-2 does not retry a memory access in case of cache miss or hazards. Using a physical address CAM, the memory access queue dynamically performs disambiguation and forwarding between accesses. When a load enters the queue, it checks all the older stores for possible bypass for each byte it needs. When a store enters the queue, it checks all the younger loads in front of another tyounger store to the same byte in the queue to decide whether to forward value to them. The queue also snoops cache refill and replace operations."

"The queue has four read ports. The first read port is used to select first result-ready instruction and write back its result. Cache hit loads are written back even when there is pending stores before it. If late coming store should forward its value to the speculatively written back load, the load and its following instructions will be cancelled. The second read port is used to select the first committed write-ready store and write its value to data cache. A store is write-ready when the value to store is valid and it has been committed (that is, cannot be cancelled). The third read port is used to issue miss request to the next level memory. Uncached accesses and exception handling use the last read port."

"Our future work includes implementing a special Java co-processor and exploiting multithreading parallelism through putting multiple processors in the same chip."

hopetoknow2 · 发表于 2006-6-25 10:53

有意思的是load speculation，龙芯2 版的memory disambiguation和Conroe的十分相似的地方。龙芯2可以搞投机，把后面的load指令提前到pending store之前。

不同点在于，龙芯2没有预测器，只要满足Cache hit条件即可。基于简单假设，只要load是缓存命中，就不会和前面的store存在相关性。如果投机失败，将load指令和后续的相关指令全部从流水线上废掉cancel，ROB作废，重新从L1I中抓取fetch它们然后再执行。 (这点也和P4的replay不同，P4是可以存在 replay队列中，不必重新到Trace cache抓，也少跑很多流水级)

Conroe采用了预测器确定是否投机，提高了准确度。

hopetoknow2 · 发表于 2006-6-25 11:53

中科院龙芯设计时，套用artisan公司的IC layout库。

而北大众志是套用springsoft公司的IC layout库。画layout图使用Mentor的IC layout设计工具。

目前是侧重于原理图、代码综合等、套用IC库，重点在于架构实现，而不是处理器工艺。没有涉及更低层的联合设计。

从RTL到GDSII，必须依靠了那些公司的各种cell库。

vivalinux · 发表于 2006-6-26 06:11

龙芯2处理器设计，中科院计算所在2004年11月的报告。

龙芯2微体系结构，发表在计算机科学与技术英文刊的论文。

[ 本帖最后由 vivalinux 于 2006-6-26 06:14 编辑 ]

netsnake · 发表于 2006-6-26 11:56

虽然看不明白，但还是下来看看，支持V大~

oxeast · 发表于 2006-10-25 02:17

生产有竞争力的产品才是正到:lol:
不然光花国家的前回报太小:unsure:

Edison · 发表于 2006-11-14 19:37

这次我找到了一份应该是最新的文档，关于Godson-2 SMT版的微体系架构和分析论文。

SM5 · 发表于 2006-11-15 19:13

有得卖了么？:wub: :wub:

oxeast · 发表于 2006-11-29 15:06

出了个盒子貌似不便宜:lol:

Edison · 发表于 2007-2-6 00:16

发表在Journal of Computer Science and Technology 07年第一期的"Implementing a 1GHz Four-Issue Out-of-Order Execution Microprocessor in a Standard Cell ASIC Methodology(基于标准单元的ASIC方法设计主频1GHz的四发射乱序执行通用处理器)"文章下载link：

中文概要"龙芯2E是一款采用四发射结构的实现64位MIPS指令集的高性能通用RISC处理器。本文介绍龙芯2E的微体系结构设计和物理设计。龙芯2E采用激进的乱序执行和存储层次技术来提高性能。龙芯2E采用7层金属的90nm CMOS工艺，使用基于标准单元加上部分位级手工布局和一些定制宏单元的方法来完成物理设计。龙芯2E的主频达到1GHz，SPEC CPU2000的分值超过500分。"

http://jcste.ict.ac.cn/paper/hww_071.pdf

Edison · 发表于 2008-6-6 15:18

Godson-2 SMT最新研究成果出来了：

http://jcst.ict.ac.cn/paper/8214.pdf

帐号		自动登录	找回密码
密码			注册

MIPS-Like CPU，Godson-2微体系结构文档（英文）追加最新的godson2-SMT文档

本帖子中包含更多资源

龙芯2处理器设计与微体系结构

本帖子中包含更多资源

本帖子中包含更多资源

本帖子中包含更多资源

MIPS-Like CPU，Godson-2微体系结构文档（英文） 追加最新的godson2-SMT文档

本帖子中包含更多资源

龙芯2处理器设计与微体系结构

本帖子中包含更多资源

本帖子中包含更多资源

本帖子中包含更多资源

MIPS-Like CPU，Godson-2微体系结构文档（英文）追加最新的godson2-SMT文档