POPPUR爱换

标题: 推土机最新消息请高人看 [打印本页]

作者: 095707 时间: 2010-8-24 13:18
提示: 作者被禁止或删除内容自动屏蔽

作者: xeon-pan 时间: 2010-8-24 13:32
没看到新东西，都是那些

作者: CC9K 时间: 2010-8-24 13:45
Slimmed Down but Double Wide!

A good way to express what Bulldozer is can be summed up as “slimmed down, but double wide”.  For each traditional core, AMD has instituted a dual ALU design with robust floating point and SSE units.  Each core can handle two threads, like SMT, but actually has separate execution units which each process individual threads without sharing execution resources.

Each unit features a single fetch and decode stage.  The decode stage is comprised of four units, but we do not yet know their inner workings.  In the previous K7/K10.5 generations of parts, there are three complex decode units.  On the Intel side with Core 2 and Nehalem, there are three simple decode units and a single complex.  AMD also did not cover subjects such as macro-ops and macro-op fusion.  AMD has beefed up their decode stage significantly though.  It simply had to, because it is now feeding dual integer schedulers and a floating point scheduler feeding 2 x 128 bit FMACs and MMX units.

Fetch, decode, floating point/SSE, and the L2 cache are the shared components.  Since most workloads are integer based, AMD doubled the integer units.  These 128 bit packed integer pipes are a step above what was offered in the Phenom II.  In theory, there should be a sizeable per clock increase in integer and floating point apps on Bulldozer over the Phenom II.  When something is more heavily threaded, then we will see dramatic improvements in performance.  Each integer core features its own L1 D-cache.  AMD has again not clarified how much L1 or L2 cache there is for each discrete unit, or L3 cache sizes for the entire processor.

Branch prediction is one area that has not seen big jumps in the past decade, but due to the shared components and their greater data requirements, it is getting a major makeover.  AMD did not cover details of this unit, other than it is new and much more robust than the older unit in previous generations of chips.  I would be curious if it held more in common with the overbuilt K6 unit than the smaller and simplified unit developed for the Athlon family of products.

In most workloads, a four unit chip can natively handle eight threads, and the chip will show up with eight logical processors.  But the workflow will be significantly different due to the shared components and how they burst out data for the execution units.

The floating point and SSE/SIMD capabilities of Bulldozer have also been given a boost.  The Phenom and Phenom II had a single, 128 bit unit.  This was an upgrade from the Athlon 64, which featured 2 x 64 bit units.  Bulldozer now has 2 x 128 bit units which can be utilized as a single 256 bit unit under situations using AVX (Advanced Vector eXtensions).  It also can also be utilized as 2 x 128 bit units, and can do 4 x 64 bit operations when needed.

作者: 095707 时间: 2010-8-24 13:47
提示: 作者被禁止或删除内容自动屏蔽

作者: cinlo 时间: 2010-8-24 14:48
我是矮人,看不懂.......

作者: Edison 时间: 2010-8-24 14:53
http://www.anandtech.com/show/38 ... at-hot-chips-2010/4

应该算是比较特别的 SMT 设计方式，让我想起了 Fermi 的 SM 采用的双调度器设计。

作者: Edison 时间: 2010-8-24 14:54
Branch Prediction and a Deeper Pipeline
Bulldozer will use a deeper pipeline with less logic per stage compared to current Phenom II/Opteron processors. AMD argues that this will ensure clock speed won’t be Pro-Ablem with the design and we should expect to see Bulldozer based products at similar if not higher clock speeds than what we have today with Phenom II.

With a deeper pipe, branch prediction becomes more important and Bulldozer has a significant change in the way branch prediction works.

In Phenom II, the branch prediction and instruction fetch logic are run in lockstep - when one stalls, the other also stalls. Branches are predicted as they are encountered. If the fetch logic grabs an x86 branch instruction, the prediction logic works in parallel to predict the likely target of that branch. However if the branch is incorrectly predicted, subsequent branches aren’t predicted until the current mispredict is correctly resolved. As a result, the fetch logic and prefetchers can’t work and potential performance is lost.

In Bulldozer the branch prediction and fetch logic are decoupled. The predictor now produces a queue of future fetch addresses. Even if there’s a mispredict the branch predictor can continue to fill its prediction queue with targets. The fetch logic can then check this queue of addresses against what’s in the instruction cache to avoid future misses in L1.

Prefetchers
With Phenom AMD implemented comparable prefetching logic to what Intel did with Core. In Bulldozer, AMD is ramping up the aggressiveness of those prefetchers. There are independent prefetchers at both the L1 and L2 levels that support larger numbers of strides and large stride sizes (both compared to what exists in current AMD architectures). There’s also a non-strided data prefetcher that looks at correlated cache misses and uses that data to prefetch into the caches.

AMD unfortunately didn’t go into more detail on its prefetchers other than to promise that they are much more aggressive than what we have today. Aggressive prefetching usually means there’s a good amount of memory bandwidth available so I’m wondering if we’ll see Bulldozer adopt a 3 - 4 channel DDR3 memory controller in high end configurations similar to what we have today with Gulftown.

Power Gating & Real Turbo Mode
Each Bulldozer module in Pro-Acessor can be clocked and power gated independently. This has two implications. You can now power off cores (in sets of two) that aren’t in use and save tons of idle power. You can also use the power savings to drive up the frequency of other cores in a Bulldozer CPU. With Bulldozer, AMD should have something functionally equivalent to Intel’s Turbo Boost modes. Since clock speed and power gating is controlled at the module level and not the core level there will still be some differences between the two but this should be much better than AMD’s current Core Turbo technology.

There’s of course extensive clock gating around the chip, but obviously the big change is power gating which AMD hasn’t had up to this point (Bobcat is also power gated).

Performance and Availability
While Bobcat is going to be in production in Q4 of this year, with system availability in Q1 of 2011 - Bulldozer is still a 2011 project and AMD isn’t giving any guidance as to when in 2011.

Parts are already back and in AMD’s labs but we have no indication of performance or rollout schedule. Given Bobcat’s schedule, I’d say that the first Bulldozer CPUs will be out no earlier than Q2 2011 and AMD’s unwillingness to specify what half of the year would imply that it’ll be a late Q2/early Q3 launch.

作者: sfeng0 时间: 2010-8-24 17:20

这回amd的构架能继续领先intel10年了

复制代码

作者: tansailuffy 时间: 2010-8-24 17:39
10年？太小看推土机了！下一代“自行车”能领先Intel 80年！

作者: the_god_of_pig 时间: 2010-8-24 17:50
咳，浮点。。。

作者: AMD11 时间: 2010-8-24 18:08

没看到新东西，都是那些
xeon-pan 发表于 2010-8-24 13:32

这里有点新的，缩水的核心：http://we.pcinlife.com/thread-1496484-1-1.html

作者: zaknafein 时间: 2010-8-24 18:21
单线程执行单元减半, 流水线加深导致P4再现, 分支预测...intel就是做P4才猛搞分支预测, amd你可以么...

作者: itany 时间: 2010-8-24 18:37

单线程执行单元减半, 流水线加深导致P4再现, 分支预测...intel就是做P4才猛搞分支预测, amd你可以么...
zaknafein 发表于 2010-8-24 18:21

没有减半啊，只是3 ALU变成双ALU，同样AGU也是。

作者: zaknafein 时间: 2010-8-24 19:40

没有减半啊，只是3 ALU变成双ALU，同样AGU也是。
itany 发表于 2010-8-24 18:37

和当时流传甚广的单核心4执行单元比缩了一半...原来还是单模块4执行...

作者: PRAM 时间: 2010-8-24 19:49

10年？太小看推土机了！下一代“自行车”能领先Intel 80年！
tansailuffy 发表于 2010-8-24 17:39

下一代改叫永动机，领先Intel 10000年！

作者: CC9K 时间: 2010-8-24 19:59

和当时流传甚广的单核心4执行单元比缩了一半...原来还是单模块4执行...
zaknafein 发表于 2010-8-24 19:40

当时的核心指的就是现在的模块，单个里面的小核心比K10强就不现实的

作者: 饭米米 时间: 2010-8-24 21:23
等实物吧，便宜就行

作者: potomac 时间: 2010-8-24 22:24
提示: 作者被禁止或删除内容自动屏蔽

作者: itany 时间: 2010-8-24 23:46

和当时流传甚广的单核心4执行单元比缩了一半...原来还是单模块4执行...
zaknafein 发表于 2010-8-24 19:40

每核心怎么可能能放得下4个执行单元；如果有必要，Intel早就放了
一个4宽度的前端加两个4ALU的执行单元纯粹是残了才会这样

作者: 千人 时间: 2010-8-24 23:54
进来看推土机的强大

欢迎光临 POPPUR爱换 (https://we.poppur.com/)