|
A.12.1Floating-Point Scheduler
The floating-point logic of the AMD Athlon 64 and AMD Opteron processors is a high-performance, fully pipelined, superscalar, out-of-order execution unit. It is capable of accepting three macro-ops per cycle from any mixture of the following types of instructions:
•x87 floating-point
•3DNow! technology
•MMX
•SSE
•SSE2
•SSE3
•SSE4a
The floating-point scheduler handles register renaming and has a dedicated 36-entry scheduler buffer organized as 12 lines of three macro-ops each. It also performs data superforwarding, micro-op issue, and out-of-order execution. The floating-point scheduler communicates with the ICU to retire a macro-op, to manage results of *COMI* and FP-to-INT movement and conversion instructions using a 64-bit-wide FP-to-INT bus, and to back out results from a branch misprediction.
Superforwarding is a performance optimization. It allows faster scheduling of a floating point operation having a dependency on a register when that register is waiting to be filled by a pure load from memory. Instead of waiting for the first instruction to write its load-data to the register and then waiting for the second instruction to read it, the load-data can be provided directly to the dependent instruction, much like regular forwarding between FPU-only operations. The result from the load is said to be "superforwarded" to the floating-point operation. In the following example, the FADD can be scheduled to execute as soon as the load operation fetches its data rather than having to wait and read it out of the register file.
fld [somefloat] ;Load a floating point
;value from memory into ST(0)
fadd st(0),st(1) ;The data from the load will be
;forwarded directly to this instruction,
;no need to read the register file
A.12.2Floating-Point Execution Unit
The floating-point execution unit (FPU) has its own out-of-order execution control and datapath. The FPU handles all register operations for x87 instructions, all 3DNow! technology operations, all MMX operations, and all SSE, SSE2, SSE3, and SSE4a operations. The FPU consists of a stack renaming unit, a register renaming unit, a scheduler, a register file, and execution units that are each capable of computing and delivering results of up to 128 bits per cycle. Figure 10 shows a block diagram of the dataflow through the FPU.
……
As shown in Figure 10, the floating-point logic uses three separate execution positions or pipes (FADD, FMUL, and FSTORE). Details on which instructions can use which pipes are specified in Appendix C.
A.13Load-Store Unit
The L1 data cache and load-store unit (LSU) are shown in Figure 11. The L1 data cache can support two 128-bit loads or two 64-bit store writes per cycle or a mix of those. The LSU consists of two queues—LS1 and LS2. LS1 can issue two L1 cache operations (loads or store tag checks) per cycle. It can issue load operations out-of-order, subject to certain dependency restrictions. LS2 effectively holds requests that missed in the L1 cache after they probe out of LS1. Store writes are done exclusively
from LS2. 128-bit stores are specially handled in that they take two LS2 entries, and the store writes are performed as two 64-bit writes. Finally, the LSU helps ensure that the architectural load and store ordering rules are preserved (a requirement for AMD64 architecture compatibility).
A.14Write Combining
AMD Family 10h processors provide four write-combining data buffers that allow four simultaneous streams. For details, see Appendix B “Implementation of Write-Combining” on page 227.
A.15Integrated Memory Controller
AMD Family 10h processors provide an integrated low-latency, high-bandwidth DDR2 memory controller.
The memory controller supports:
•DRAM chips that are 4, 8, and 16 bits wide within a DIMM.
•Interleaving memory within DIMMs.
•ECC checking with double-bit detection and single-bit correction.
•Both dual-independent 64-bit channel and single 128-bit channel operation.
•Optimized scheduling algorithms and access pattern predictors to improve latency and achieved bandwidth, particularly for interleaved streams of read and write DRAM accesses.
•A data prefetcher.
Prefetched data is held in the memory controller itself and is not speculatively filled into the L1, L2, or L3 caches. This prefetcher is able to capture both positive and negative stride values (both unit and non-unit) of cache-line size, as well as some more complicated access patterns.
For specifications on a certain processor’s memory controller, see the data sheet for that processor. For information on how to program the memory controller, see the BIOS and Kernel Developer’s Guide for AMD Family 10h Processors, order# 31116.
A.16HyperTransport™ Technology Interface
HyperTransport technology is a scalable, high-speed, low-latency, point-to-point, packetized link that:
•Enables high data transfer rates.
•Simplifies connectivity by replacing legacy buses and bridges.
•Reduces latencies and bottlenecks within systems.
When compared with traditional technologies, HyperTransport technology allows much faster data-transfer rates. For more information on HyperTransport technology, see the HyperTransport I/O Link Specification, available at www.hypertransport.org.
On AMD Athlon 64 and AMD Opteron processors, HyperTransport technology provides the link to I/O devices. Some processor models—for example, those designed for use in multiprocessing systems—also utilize HyperTransport technology to connect to other processors. See the BIOS and Kernel Developer's Guide for your particular processor for details concerning HyperTransport technology implementation details.
In addition to supporting previous HyperTransport interfaces, AMD Family 10h processors support a newer version of the HyperTransport standard: HyperTransport3. HyperTransport3 increases the aggregate link bandwidth to a maximum of 20.8 Gbyte/s (16-bit link). HyperTransport3 also adds HyperTransport Retry which improves RAS by allowing detection and retransmission of packets corrupted in transit.
Additional features in the AMD Family 10h HyperTransport implementation include:
•HyperTransport Link Bandwidth Balancing which allows multiple HyperTransport links to be "teamed" (subject to platform design and AMD Family 10h processors) to carry coherent traffic.
•HyperTransport Link Splitting, which allows a single 16-bit link to be split into two 8-bit links.
These features allow for further optimized platform designs which can increase system bandwidth and reduce latency. |
|