重温AMD的反向线程技术（speculatively executing threads of instructions）

Edison · 发表于 2006-6-24 11:32

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention is related to the field of microprocessors and, more particularly, to multithreading in multiprocessors.

2. Description of the Related Art

Computer systems employing multiple processing units hold a promise of economically accommodating performance capabilities that surpass those of current single-processor based systems. Within a multiprocessing environment, rather than concentrating all the processing for an application in a single processor, tasks are divided into groups or "threads" that can be handled by separate processors. The overall processing load is thereby distributed among several processors, and the distributed tasks may be executed simultaneously in parallel. The operating system software divides various portions of the program code into the separately executable threads, and typically assigns a priority level to each thread.

Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.

An important feature of microprocessors is the degree to which they can take advantage of parallelism. Parallelism is the execution of instructions in parallel, rather than serially. Superscalar processors are able to identify and utilize fine grained instruction level parallelism by executing certain instructions in parallel. However, this type of parallelism is limited by data dependencies between instructions. Further, as mentioned above, computer systems which contain more than one processor may improve performance by dividing the workload presented by the computer processes. By identifying higher levels of parallelism, multi-processor computer systems may execute larger segments of code, or threads, in parallel on separate processors. Because microprocessors and operating systems cannot identify these segments of code which are amenable to parallel multithreaded execution, they are identified by the application code itself. Generally, the operating system is responsible for scheduling the various threads of execution among the available processors in a multi-processor system.

One problem with parallel multithreading is that the overhead involved in scheduling the threads for execution by the operating system is such that shorter segments of code cannot efficiently take advantage of parallel multithreading. Consequently, potential performance gains from parallel multithreading are not attainable.

SUMMARY OF THE INVENTION

The problems outlined above are in large part solved by a microprocessor and method as described herein. Additional circuitry is included in a form of symmetrical multiprocessing system which enables the scheduling and speculative execution of multiple threads on multiple processors without the involvement and inherent overhead of the operating system. Advantageously, parallel multithreaded execution is more efficient and performance is improved.

Broadly speaking, a multiprocessor computer is contemplated comprising a plurality of processors, wherein said processors include a register file, a reorder buffer and circuitry to support speculative multithreaded execution. In addition, the multiprocessor computer includes one or more reorder buffer tag translation buffers and a thread control device. The thread control device is configured to store and transmit instructions between the processors. The thread control device and instructions support parallel speculative multithreaded execution.

In addition, a method is contemplated which comprises performing thread setup for execution of a second thread on a second processor, wherein the setup comprises a first processor conveying setup instructions to a second processor, where the setup instructions are speculatively executed on the second processor. A startup instruction is conveyed from the first processor to the second processor which begins speculative execution of the second thread on the second processor. The second processor begins speculative execution of the second thread in parallel with the execution of a thread on the first processor, in response to receiving the startup instruction. Execution of the second thread is terminated, in response to retiring a termination instruction in the second processor. Finally, the results of the execution of the second thread are conveyed to the first processor, in response to the second processor receiving a retrieve result instruction, where the retrieve result instruction is speculatively executed by the second processor.

Edison · 发表于 2006-6-24 11:35

流程图（1）：

Edison · 发表于 2006-6-24 11:36

功能模块图：

hopetoknow2 · 发表于 2006-6-24 14:04

kankan Intel

[ 本帖最后由 hopetoknow2 于 2006-6-24 14:15 编辑 ]

hopetoknow2 · 发表于 2006-6-24 14:07

原帖由 Edison 于 2006-6-24 11:35 发表
流程图（1）：

解释如下
In general, processing core 14A executes single threaded code (block 330) until a multithread setup instruction is encountered. When processing core 12A encounters a multithread setup instruction (block 332), processing core 12A conveys thread setup instructions to ICU 320A which conveys them to FIFO 1310A. ICU 320B retrieves instructions from FIFO 1310A and transfers them to processing core 14B. Subsequently, master processor 12A conveys a thread 2 startup instruction (block 334) to ICU 320A which places the instruction into FIFO 1310A. ICU 320B retrieves the thread startup instruction from FIFO 1310A and transfers it to processing core 14B. Processing core 14B then begins fetching and executing the thread 2 code (block 338) while processor 12A continues execution of thread 1 code (block 336). Upon execution and retirement of a JOIN instruction (blocks 340 and 342) by both processors 12, slave processor 12B terminates execution of thread 2 and single threaded execution resumes with master processor 12A. Master processor 12A may then convey another instruction to processor 12B which causes slave processor 12B to convey thread 2 execution results to master processor 12A via FIFO 310B. Master processor 12A may then consolidate execution results from the separate threads (block 344) and continue normal execution (block 346). To summarize, master processor 12A sets up a second thread for execution on slave processor 12B. Both the master 12A and slave 12B processors execute threads in parallel. Master processor 12A then obtains the second thread execution results from the slave processor.

[ 本帖最后由 hopetoknow2 于 2006-6-24 14:10 编辑 ]

hopetoknow2 · 发表于 2006-6-24 15:18

cho你犯大错误了。这根本不是"反超线程"。根本没有谈在物理多线程处理器上，加速单线程程序。
而是说在在物理多线程处理器上，设法加速多线程程序。程序中本来就有多个线程，不等条件成熟，先提前执行那些线程－－这里的线程推测根本不是针对“在物理多线程处理器上，加速单线程程序”问题。
Support for Speculative Thread Execution
State of the art superscalar processors have large instruction windows. Consequently, to wait for a Fork instruction to retire before thread startup may result in significant delays. To allow optimal thread startup, the mechanism should allow for speculative startup of threads. This allows the second thread to startup and execute in the slave processor long before the Fork instruction retires in the master processor. Advantageously, performance of the multithreaded multiprocessor is improved.

还是Intel简单谈过物理多线程处理器上，加速单线程程序。举例，一个单线程程序，可以利用发现call指令或间接跳转指令来作，则合理假定call指令其一般最终是能够返回原调用地址，可多安排分配一个物理线程(即物理核心或SMT等)做一个推测线程。

Edison · 发表于 2006-6-24 20:17

OCR了一下：

这个专利只是没有说明AMD如何做singel thread的split动作，但是至少说明了AMD是有方面的研究。

这个技术引入了四条新的指令，AMD的做法是采用JIT编译器作二进制转换？

hopetoknow2 · 发表于 2006-6-24 22:52

这个专利确实没研究

是多线程的专利，那我都说，做了singel thread的split后，也可以用anit线程呀？

hopetoknow2 · 发表于 2006-6-24 22:59

要不，你就是误解了那张图？

例如一个OpenMP程序，经常是一部分是单线程的，一部分是fork后多线程的，然后又join，离开多线程回到单线程。

那图是指正常的进出多线程，用新指令减少开销，处理器自动化程度高一些，可以避免操作系统开销。

压根不是说原来一个单线程程序，后来被分成多线程后，由CMP来执行。

只看该作者 · 发表于 2006-6-27 19:08

提示: 作者被禁止或删除内容自动屏蔽

Edison · 发表于 2006-7-2 11:09

去年年底NEC发布的硬件自动并行化技术：

NEC Develops Multicore Processor Technology Enabling Automatic Parallelization of Application Programs
- Dramatically reduces software development time & cost of multicore processors - ***** For immediate use December 19, 2005

Tokyo, December 19, 2005 --- NEC Corporation today announced that it has succeeded in the development of multicore processor technology capable of performing automatic parallelization of application programs, without modifying them.

Key features of the multicore processor technology

(1) An automatic parallelizing compiler, capable of effective extraction of parallelism from an application program utilizing its profile information (1*).
(2) An additional instruction-set, designed to minimize parallelization overheads.
(3) Processor architecture, which efficiently handles speculative execution (2*).
(4) Implementation realized by a simple extension to conventional processors.

The distinctive feature of this new technology is the ability of the automatic parallelizing compiler that utilizes profile information to aggressively exploit parallelization patterns, which are effective for accelerating the speed of application programs. In addition, although the parallelization is speculative, the speculation is almost always completely accurate. The speculation hardware works as a safety net by handling any rare misses, guaranteeing the correctness of the execution. This ensures that the compiler is not conservative in decisions concerned with these cases, resulting in an increase in the amount of parallelism exploited. The parallelism exploitation is supported by the speculative execution hardware that realizes efficient handling of detection of incorrect execution orders caused by the parallel execution of the program parts, cancellation of the incorrectly executed part, and re-execution of it. Moreover, the parallelization process can be performed in a practical period of time.

In an increasingly networked society, the need for enhanced functionality and performance of terminals such as mobile phones and information appliances, while maintaining a low level of power consumption, is growing. Recently, many system-on-chips (SoCs) employing multicore and multiprocessor technology have been introduced practically to meet this expanding demand. This technology deploys multiple processor cores on a chip and effectively utilizes these multiple resources by parallelizing application programs. However, parallelization with conventional multiprocessor technology requires the manual modification of application source programs. Manual labor increases the development and verification cost for software development, which is in turn made more complex by the growing size and complexity of the software itself. Therefore, multiprocessor technology, which can automatically parallelize application programs without manual modification, has been long sought after in this field. However, nobody has succeeded in bringing automatic parallelization technology to a practical stage to date.

NEC believes that its automatic parallelization technology is the first to be brought to a stage of practical use. This is supported by the fact that NEC has succeeded in operating this technology on a field-programmable gate array (FPGA). Moreover, its implementation has confirmed that only a marginal hardware extension is required and that application program speed is actually accelerated.

The newly developed technology realizes automatic parallelization of application programs and a dramatic reduction in time and cost of parallelization. In addition, we have observed cases where automatic parallelization accelerates the speed of programs at a greater rate than that of manual parallelization. For example, one test showed that manual parallelization of an application program took four months of time with one person carrying out the task, however, automatic parallelization cut this time to just three minutes with no manual labor involved at all. In addition, the application program that has been parallelized manually runs 1.95 times faster with four processors than the original application program running with one processor. However, the application program that has been parallelized automatically runs 2.83 times faster with four processors, which indicates that automatic parallelization achieves greater acceleration than manual parallelization. This shows that automatic parallelization facilitates development of software with high functionality and performance through multicore and multiprocessor technology, at lower cost over a shorter time frame. This will lead to the provision of terminals such as cellular phones and information appliances with enhanced functionality and performance.

NEC will continue to advance the research and development of its multicore processor technology toward early release of products incorporating it.

HeavenPR · 发表于 2006-7-6 14:22

我觉得最终的加速法则还是从设计软件的时候就考虑到并行执行，而不是让本来就紧缺的晶体管资源用来搞这个单线程拆分，这么多晶体管多加两个 SSEx 单元，多媒体性能马上飙升

dreamz3 · 发表于 2006-7-8 00:28

楼上的思想就是P4采用的，事实证明还不到时候

harleylg · 发表于 2006-7-8 18:24

恩，问题在于332的Muti Thread Setup如何进行。本来是Single Thread的，如何把其中一部分代码抽离出去？抽离出去后如果两个Thread需要互相访问数据能否自动添加有关代码？如果后面的问题不能解决的话，Anti的应用范围就很窄了……

hopetoknow2 · 发表于 2006-7-8 18:38

原帖由 harleylg 于 2006-7-8 18:24 发表
恩，问题在于332的Muti Thread Setup如何进行。本来是Single Thread的，如何把其中一部分代码抽离出去？抽离出去后如果两个Thread需要互相访问数据能否自动添加有关代码？如果后面的问题不能解决的话，Anti的应用 ...

哈哈，问题的实质就在这里。误解的根源

332并不是指把单线转变为多线，而是－－程序代码本身到了332后，本来就是多线代码了。

cho就是错在这。

该专利研究，如何利用新指令，来实现单线->多线或多线->单线等切换，可避免操作系统的开销。

程序代码本身就是含有了单线和多线代码.

wangluolangzi11 · 发表于 2008-3-4 22:01

AMD 钱还没赚够这个技术还需要耐心等待

ChrisVinyard · 发表于 2008-3-4 22:23

。。。。。。这么老的帖子也被挖出来了啊。。。。

AMD11 · 发表于 2008-3-5 09:08

要是能够普适地自动添加代码而不影响（结果性的影响）原来逻辑，硬件上就具有相当的智能了。AMD的积累还不到这个地步。

[ 本帖最后由 AMD11 于 2008-3-5 09:12 编辑 ]

ChrisVinyard · 发表于 2008-3-5 09:58

AMD的B3步进的翼龙迟迟不出，是不是准备对这种技术提供优化和直接支持？！准备决堤反击:w00t):

AMD11 · 发表于 2008-3-5 10:34

原帖由 HeavenPR 于 2006-7-6 14:22 发表
我觉得最终的加速法则还是从设计软件的时候就考虑到并行执行，而不是让本来就紧缺的晶体管资源用来搞这个单线程拆分，这么多晶体管多加两个 SSEx 单元，多媒体性能马上飙升

比较支持这个概念，毕竟软件开发者最了解自己需求。

帐号		自动登录	找回密码
密码			注册

potomac 该用户已被删除	10^# 发表于 2006-6-27 19:08 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
potomac 该用户已被删除
	回复支持反对使用道具举报显身卡

重温AMD的反向线程技术（speculatively executing threads of instructions）

本帖子中包含更多资源

本帖子中包含更多资源

本帖子中包含更多资源

本帖子中包含更多资源