Barcelona优化指南上线

Prescott · 发表于 2007-5-5 14:04

w00t)

AMD真是大方啊，这么快优化指南就公开了。w00t)

http://www.amd.com/us-en/assets/ ... tech_docs/40546.pdf

只看该作者 · 发表于 2007-5-5 14:05

提示: 作者被禁止或删除内容自动屏蔽

Prescott · 发表于 2007-5-5 14:33

原帖由 zacard 于 2007-5-5 14:05 发表
官方优化没超频效果明显:loveliness:

谁让你优化了:wacko:

Barcelona已经彻底公开了

只看该作者 · 发表于 2007-5-5 14:33

提示: 作者被禁止或删除内容自动屏蔽

Dr.BT · 发表于 2007-5-5 14:42

原帖由 zacard 于 2007-5-5 14:33 发表

你不是发个优化指南吗？难道不是用来优化的？:funk:

优化指南是最好的说明书啊:unsure:

the_god_of_pig · 发表于 2007-5-5 19:34

A.5.1L1 Instruction Cache
The out-of-order execution engine of the AMD Athlon 64 and AMD Opteron processors contains a 64-Kbyte, 2-way set-associative L1 instruction cache. Each line in this cache is 64 bytes long. Functions associated with the L1 instruction cache are instruction loads, instruction prefetching, instruction predecoding, and branch prediction. Requests that miss in the L1 instruction cache are fetched from the L2 cache or, subsequently, from the L3 cache or system memory using Direct Connect Architecture.
On misses, the L1 instruction cache generates fill requests to a naturally aligned 64-byte line containing the instructions and the next sequential line of 64 bytes (a prefetch). Because code typically exhibits spatial locality, prefetching is an effective technique for avoiding decode stalls. Cache-line replacement is based on a least-recently-used replacement algorithm.
Predecoding begins as the L1 instruction cache is filled. Predecode information is generated and stored alongside the instruction cache. This information is used to help efficiently identify the boundaries between variable length AMD64 instructions.
A.5.2L1 Data Cache
The AMD Family 10h processor contains a 64-Kbyte, 2-way set-associative L1 data cache with two 128-bit ports. This cache is a write-allocate and writeback cache that uses a least-recently-used replacement policy. It is divided into eight banks, each 16 bytes wide. In addition, the L1 cache supports the MOESI (Modified, Owner, Exclusive, Shared, and Invalid) cache-coherency protocol and ECC. There is a prefetcher that brings data into the L1 cache to avoid misses. The L1 data cache has a 3-cycle load-to-use latency.
A.5.3L2 Cache
The AMD Family 10h processor has one integrated L2 cache per core. This full-speed on-die L2 cache features an exclusive cache architecture. The L2 cache contains only victim or copy-back cache blocks that are to be written back to the memory subsystem as a result of a conflict miss. These terms, victim or copy-back, refer to cache blocks that were previously held in the L1 cache but which had to be overwritten (evicted) to make room for newer data. The victim buffer contains data evicted from the L1 cache. The latency of the L2 cache is 9 cycles beyond the L1 cache.
Size and associativity of the AMD Family 10h processor L2 cache is implementation dependent. See the appropriate BIOS and Kernel Developer’s Guide for details.
A.5.4L3 Cache
The AMD Family 10h processor contains an integrated L3 cache which is dynamically shared between all cores in AMD multi-core processors. The L3 cache is considered a non-inclusive victim cache architecture optimized for multi-core AMD processors. Blocks are allocated into the L3 on L2 victim/copy-backs. Requests that hit in the L3 cache can either leave the data in the L3 cache—if it is likely the data is being accessed by multiple cores—or remove the data from the L3 cache (and placeit solely in the L1 cache, creating space for other L2 victim/copy-backs), if it is likely the data is only being accessed by a single core. Furthermore, the cache features bandwidth-adaptive policies that optimize latency when requested bandwidth is low, but allows scaling to higher aggregate L3 bandwidth when required (such as in a multi-core environment).
A.6Branch-Prediction Table
AMD Family 10h processors predict that a branch is not taken until it is taken once. Then it is predicted that the branch is taken, until it is not taken. Thereafter, the branch prediction table is used.
The fetch logic accesses the branch prediction table in parallel with the L1 instruction cache. The information stored in the branch prediction table is used to predict the direction of branch instructions. When instruction cache lines are evicted to the L2 cache, branch selectors and predecode information are also stored in the L2 cache.
AMD Family 10h processors employ combinations of a branch target address buffer (BTB), a global history bimodal counter (GHBC) table, and a return address stack (RAS) to predict and accelerate branches. Predicted-taken branches incur only a single-cycle delay to redirect the instruction fetcher to the target instruction. In the event of a misprediction, the minimum penalty is 10 cycles.
The BTB is a 2048-entry table that contains the predicted target address of a branch in each entry. The 16384-entry GHBC table contains 2-bit saturating counters that are used to predict whether a conditional branch is taken. The GHBC table is indexed using the outcome (taken or not taken) of the last conditional branches and the branch address.
AMD Family 10h processors implement a separate 512- entry target array used to predict indirect branches with multiple dynamic targets.
In addition, the processors implement a 24-entry return address stack to predict return addresses from a near or far call. As calls are fetched, the next return address is pushed onto the return stack. Subsequent returns pop a predicted return address off the top of the stack.
A.7Fetch-Decode Unit
The fetch-decode unit performs early decoding of AMD64 instructions into macro-ops. AMD Family 10h processors contain two separate decoders; one to decode DirectPath instructions and one to decode VectorPath instructions. When the target 32-byte instruction window is obtained from the L1 instruction cache, the instruction bytes are examined to determine whether the type of basic decode to take place is DirectPath or VectorPath. The outputs of the early decoders keep all (DirectPath or VectorPath) instructions in program order. Early decoding produces three macro-ops per cycle from either path. The outputs of both decoders are multiplexed together and passed to the next stage in the pipeline, the instruction control unit. Decoding a VectorPath instruction may prevent the simultaneous decoding of a DirectPath instruction.

the_god_of_pig · 发表于 2007-5-5 19:36

A.8Sideband Stack Optimizer
The Sideband Stack Optimizer tracks the stack-pointer value. This allows the processor to execute in parallel any set of one or more instructions that implicitly or explicitly reference the stack-pointer. “Stack Operations” on page 59 discusses the Sideband Stack Optimizer in greater detail.
A.9Instruction Control Unit
The instruction control unit (ICU) is the control center for the AMD Athlon 64 and AMD Opteron processors. It controls the centralized in-flight reorder buffer, the integer scheduler, and the floating-point scheduler. In turn, the ICU is responsible for the following functions: macro-op dispatch, macro-op retirement, register and flag dependency resolution and renaming, execution resource management, interrupts, exceptions, and branch mispredictions.
The instruction control unit takes the three macro-ops that are produced during each cycle from the early decoders and places them in a centralized, fixed-issue reorder buffer. This buffer is organized into 24 lines of three macro-ops each. The reorder buffer allows the instruction control unit to track and monitor up to 72 in-flight macro-ops (whether integer or floating-point) for maximum instruction throughput. The instruction control unit can simultaneously dispatch multiple macro-ops from the reorder buffer to both the integer and floating-point schedulers for final decode, issue, and execution as micro-ops.
A.10Translation-Lookaside Buffer
A translation-lookaside buffer (TLB) holds the most-recently-used page mapping information. It assists and accelerates the translation of virtual addresses to physical addresses.
The AMD Athlon 64 and AMD Opteron processors utilize a two-level TLB structure.
A.10.1L1 Instruction TLB Specifications
The AMD Family 10h processor contains a fully-associative L1 instruction TLB with 32 4-Kbyte page entries and 16 2-Mbyte page entries. 4-Mbyte pages require two 2-Mbyte entries; thus, the number of entries available for 4-Mbyte pages is one half the number of 2-Mbyte page entries.
A.10.2L1 Data TLB Specifications
The AMD Family 10h processor contains a fully-associative L1 data TLB with 48 entries for 4-Kbyte and 2-Mbyte pages. Support for 1-Gbyte pages has also been added. 4-Mbyte pages require two 2-Mbyte entries; thus, the number of entries available for 4-Mbyte pages is one half the number of 2-Mbyte page entries.
A.10.3L2 Instruction TLB Specifications
The AMD Family 10h processor contains a 4-way set-associative L2 instruction TLB with 512 4-Kbyte page entries.
A.10.4L2 Data TLB Specifications
The AMD Family 10h processor contains an L2 data TLB with 512 4-Kbyte page entries (4-way set-associative) and 128 2-Mbyte page entries (2-way set-associative). 4-Mbyte pages require two 2-Mbyte entries; thus, the number of entries available for 4-Mbyte pages is one half the number of 2-Mbyte page entries
A.11Integer Unit
The integer unit consists of two components, the integer scheduler, which feeds the integer execution pipes, and the integer execution unit, which carries out several types of operations discussed below.
A.11.1Integer Scheduler
The integer scheduler is based on a three-wide queuing system (also known as a reservation station) that feeds three integer execution positions or pipes. The reservation stations are eight entries deep, for a total queuing system of 24 integer macro-ops. Each reservation station divides the macro-ops into integer and address generation micro-ops, as required.
A.11.2Integer Execution Unit
The integer execution pipeline consists of three identical pipes (0, 1, and 2). Each integer pipe consists of an arithmetic-logic unit (ALU) and an address generation unit (AGU). The integer execution pipeline is organized to match the three macro-op dispatch pipes in the ICU as shown in Figure 9.

……
Macro-ops are broken down into micro-ops in the schedulers. Micro-ops are executed when their operands are available, either from the register file or result buses. Micro-ops from a single operation can execute out-of-order. In addition, a particular integer pipe can execute two micro-ops from different macro-ops (one in the ALU and one in the AGU) at the same time. (See Figure 9.)
Each of the three ALUs performs general purpose logic functions, arithmetic functions, conditional functions, divide step functions, status flag multiplexing, and branch resolutions. The AGUs calculate the logical addresses for loads, stores, and LEAs. A load and store unit reads and writes data to and from the L1 data cache. The integer scheduler sends a completion status to the ICU when the outstanding micro-ops for a given macro-op are executed. (For more information on the LSU, see section A.13 on page 223.)
All integer operations can be handled within any of the three ALUs with the exception of multiplication, LZCNT, and POPCNT operations. Multiplication is handled by a pipelined multiplier that is attached to the pipeline at pipe 0, as shown in Figure 9 on page 221. Multiplication operations always issue to integer pipe 0, and the issue logic creates result bus bubbles for the multiplier in integer pipes 0 and 1 by preventing non-multiply micro-ops from issuing at the appropriate time. The LZCNT and POPCNT operations are handled in a pipelined unit attached to pipe 2, as shown in Figure 9 on page 221. The LZCNT/POPCNT operations always issue to integer pipe 2, and the issue logic creates a result bus bubble in integer pipe 2 by preventing non-LZCNT/POPCNT operations from issuing at the appropriate time.
A.12Floating-Point Unit
The floating-point unit consists of two components, the floating-point scheduler, which performs several complex functions prior to actually feeding into the floating-point execution unit, which carries out several types of operations discussed below.

the_god_of_pig · 发表于 2007-5-5 19:39

A.12.1Floating-Point Scheduler
The floating-point logic of the AMD Athlon 64 and AMD Opteron processors is a high-performance, fully pipelined, superscalar, out-of-order execution unit. It is capable of accepting three macro-ops per cycle from any mixture of the following types of instructions:
•x87 floating-point
•3DNow! technology
•MMX
•SSE
•SSE2
•SSE3
•SSE4a
The floating-point scheduler handles register renaming and has a dedicated 36-entry scheduler buffer organized as 12 lines of three macro-ops each. It also performs data superforwarding, micro-op issue, and out-of-order execution. The floating-point scheduler communicates with the ICU to retire a macro-op, to manage results of *COMI* and FP-to-INT movement and conversion instructions using a 64-bit-wide FP-to-INT bus, and to back out results from a branch misprediction.
Superforwarding is a performance optimization. It allows faster scheduling of a floating point operation having a dependency on a register when that register is waiting to be filled by a pure load from memory. Instead of waiting for the first instruction to write its load-data to the register and then waiting for the second instruction to read it, the load-data can be provided directly to the dependent instruction, much like regular forwarding between FPU-only operations. The result from the load is said to be "superforwarded" to the floating-point operation. In the following example, the FADD can be scheduled to execute as soon as the load operation fetches its data rather than having to wait and read it out of the register file.
fld [somefloat] ;Load a floating point
;value from memory into ST(0)
fadd st(0),st(1) ;The data from the load will be
;forwarded directly to this instruction,
;no need to read the register file
A.12.2Floating-Point Execution Unit
The floating-point execution unit (FPU) has its own out-of-order execution control and datapath. The FPU handles all register operations for x87 instructions, all 3DNow! technology operations, all MMX operations, and all SSE, SSE2, SSE3, and SSE4a operations. The FPU consists of a stack renaming unit, a register renaming unit, a scheduler, a register file, and execution units that are each capable of computing and delivering results of up to 128 bits per cycle. Figure 10 shows a block diagram of the dataflow through the FPU.

……

As shown in Figure 10, the floating-point logic uses three separate execution positions or pipes (FADD, FMUL, and FSTORE). Details on which instructions can use which pipes are specified in Appendix C.
A.13Load-Store Unit
The L1 data cache and load-store unit (LSU) are shown in Figure 11. The L1 data cache can support two 128-bit loads or two 64-bit store writes per cycle or a mix of those. The LSU consists of two queues—LS1 and LS2. LS1 can issue two L1 cache operations (loads or store tag checks) per cycle. It can issue load operations out-of-order, subject to certain dependency restrictions. LS2 effectively holds requests that missed in the L1 cache after they probe out of LS1. Store writes are done exclusively
from LS2. 128-bit stores are specially handled in that they take two LS2 entries, and the store writes are performed as two 64-bit writes. Finally, the LSU helps ensure that the architectural load and store ordering rules are preserved (a requirement for AMD64 architecture compatibility).

A.14Write Combining
AMD Family 10h processors provide four write-combining data buffers that allow four simultaneous streams. For details, see Appendix B “Implementation of Write-Combining” on page 227.
A.15Integrated Memory Controller
AMD Family 10h processors provide an integrated low-latency, high-bandwidth DDR2 memory controller.
The memory controller supports:
•DRAM chips that are 4, 8, and 16 bits wide within a DIMM.
•Interleaving memory within DIMMs.
•ECC checking with double-bit detection and single-bit correction.
•Both dual-independent 64-bit channel and single 128-bit channel operation.
•Optimized scheduling algorithms and access pattern predictors to improve latency and achieved bandwidth, particularly for interleaved streams of read and write DRAM accesses.
•A data prefetcher.
Prefetched data is held in the memory controller itself and is not speculatively filled into the L1, L2, or L3 caches. This prefetcher is able to capture both positive and negative stride values (both unit and non-unit) of cache-line size, as well as some more complicated access patterns.
For specifications on a certain processor’s memory controller, see the data sheet for that processor. For information on how to program the memory controller, see the BIOS and Kernel Developer’s Guide for AMD Family 10h Processors, order# 31116.
A.16HyperTransport™ Technology Interface
HyperTransport technology is a scalable, high-speed, low-latency, point-to-point, packetized link that:
•Enables high data transfer rates.
•Simplifies connectivity by replacing legacy buses and bridges.
•Reduces latencies and bottlenecks within systems.
When compared with traditional technologies, HyperTransport technology allows much faster data-transfer rates. For more information on HyperTransport technology, see the HyperTransport I/O Link Specification, available at www.hypertransport.org.
On AMD Athlon 64 and AMD Opteron processors, HyperTransport technology provides the link to I/O devices. Some processor models—for example, those designed for use in multiprocessing systems—also utilize HyperTransport technology to connect to other processors. See the BIOS and Kernel Developer's Guide for your particular processor for details concerning HyperTransport technology implementation details.
In addition to supporting previous HyperTransport interfaces, AMD Family 10h processors support a newer version of the HyperTransport standard: HyperTransport3. HyperTransport3 increases the aggregate link bandwidth to a maximum of 20.8 Gbyte/s (16-bit link). HyperTransport3 also adds HyperTransport Retry which improves RAS by allowing detection and retransmission of packets corrupted in transit.
Additional features in the AMD Family 10h HyperTransport implementation include:
•HyperTransport Link Bandwidth Balancing which allows multiple HyperTransport links to be "teamed" (subject to platform design and AMD Family 10h processors) to carry coherent traffic.
•HyperTransport Link Splitting, which allows a single 16-bit link to be split into two 8-bit links.
These features allow for further optimized platform designs which can increase system bandwidth and reduce latency.

the_god_of_pig · 发表于 2007-5-5 19:39

ctrl C/V了一下，

谁有心情研究去吧:p

红发IXFXI · 发表于 2007-5-5 20:00

:huh: 偶想知道。。。cpu都没发布。。这个东西主要给谁看???

Prescott · 发表于 2007-5-5 20:58

原帖由 红发IXFXI 于 2007-5-5 20:00 发表
:huh: 偶想知道。。。cpu都没发布。。这个东西主要给谁看???

嘿嘿，其实没什么人看，连Intel的优化指南都没几个人有兴趣看。一般都要花钱请人看，人家才看的。:lol:

killua1109 · 发表于 2007-5-5 21:37

鸡场看不懂啊来点实质性的东东更好了

the_god_of_pig · 发表于 2007-5-5 22:14

赶紧发布吧，纸面也好，出货也罢，只要把NDA过期掉，让我们看看实际怎样就好:lol:

让我们看看将改变竞争格局的k10到底多强:huh:

smartdog · 发表于 2007-5-11 23:31

有双核的优化指南吗？

floramomoko · 发表于 2007-5-12 00:07

Barcelona果然有sse4 加多一個a

帐号		自动登录	找回密码
密码			注册

zacard 该用户已被删除	2^# 发表于 2007-5-5 14:05 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
zacard 该用户已被删除
	回复支持反对使用道具举报显身卡

zacard 该用户已被删除	4^# 发表于 2007-5-5 14:33 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
zacard 该用户已被删除
	回复支持反对使用道具举报显身卡

Barcelona优化指南上线

浏览过的版块