上海缩小了与nehalem的性能差距，至少linpack测试是如此。

elisha · 发表于 2008-12-3 15:01

原帖由 itany 于 2008-12-3 14:37 发表

关键是阁下怎么解释超线程的情况下i7的性能下降呢？

Linpack纯理论性能测试，开SMT只能适得其反吧

tomsmith123 · 发表于 2008-12-3 15:19

原帖由 itany 于 2008-12-3 14:37 发表

关键是阁下怎么解释超线程的情况下i7的性能下降呢？

HT 开启后，逻辑核心多了，OS 调度是根据负载均衡原则调度的，由于事实上只有一套执行机制，在非ALU 瓶颈的情况下，两个逻辑核心在切换同一套执行机构使用，增加了额外的消耗。
另一方面，两个逻辑核心对CACHE 也是有影响的。

只看该作者 · 发表于 2008-12-3 15:36

提示: 作者被禁止或删除内容自动屏蔽

itany · 发表于 2008-12-3 17:17

原帖由 tomsmith123 于 2008-12-3 15:19 发表

HT 开启后，逻辑核心多了，OS 调度是根据负载均衡原则调度的，由于事实上只有一套执行机制，在非ALU 瓶颈的情况下，两个逻辑核心在切换同一套执行机构使用，增加了额外的消耗。
另一方面，两个逻辑核心对CACHE ...

i7的超线程可不是两个逻辑核心在同一个执行机构上切换，而是两个逻辑核心并发的
我觉得主要还是执行路径的宽度上存在瓶颈，比如L1到取指缓冲队列的宽度，还有L1的大小。毕竟L1I在HT的情况下是平分给两个线程的

tomsmith123 · 发表于 2008-12-3 18:03

原帖由 itany 于 2008-12-3 17:17 发表

i7的超线程可不是两个逻辑核心在同一个执行机构上切换，而是两个逻辑核心并发的
我觉得主要还是执行路径的宽度上存在瓶颈，比如L1到取指缓冲队列的宽度，还有L1的大小。毕竟L1I在HT的情况下是平分给两个线程的

HT 是共享L1的。L1 的宽度成本很高，相当于整个内部总线的宽度，代价太高了。
SMT 或者HT 的意义在于，逻辑上两个线程的切换速度非常快，可以极大改善响应时效，对于密集计算，内存瓶颈的情况下，只能增加处理的负担。

itany · 发表于 2008-12-3 19:18

原帖由 tomsmith123 于 2008-12-3 18:03 发表

HT 是共享L1的。L1 的宽度成本很高，相当于整个内部总线的宽度，代价太高了。
SMT 或者HT 的意义在于，逻辑上两个线程的切换速度非常快，可以极大改善响应时效，对于密集计算，内存瓶颈的情况下，只能增加处理的负 ...

1 开启HT之后，L1I是直接分割成两半，每个线程用一半；L1D是动态竞争的
2 L1的带宽成本不是那么高，而且L1的位宽也不是内部总线的宽度。尤其是提高L1I的宽度和大小，还是有必要的
3 i7的SMT也不是两个线程动态切换的，而是并发的。P4只有一个解码器，当然是动态切换的了
当然我可能有的地方说错，抛砖引玉

[ 本帖最后由 itany 于 2008-12-3 19:19 编辑 ]

卧槽泥马 · 发表于 2008-12-3 19:45

人家已经有part2了，换个话题吧

The last post generated some very interesting comments and questions, which I wanted to address. Unfortunately, some people misinterpreted the post as a "the best scores Nehalem and Shanghai can get in Linpack" review.

So let me make this very clear: this and the previous blogpost are not meant to be a "buyer's guide". The Nehalem desktop system and AMD "Shanghai" server are completely different machines, targeted at totally different markets. Normally, we should wait for the Xeon 5500 to run these kind of benchmarks, but consider this a preview out of curiosity.

Secondly, we were not trying to get the highest possible LINPACK scores on both architectures. We wanted to use one binary which has good optimizations for both AMD's and Intel CPU's. Fully optimized binaries won't even run on the other CPU. Our only goal is to get an idea how the Nehalem and Shanghai architectures compare when running a "LINPACK" alike binary which is optimized to run on all machines.

Thirdly, this is not our review of course. This is a blogpost which talks about some of the tests we are doing for the review.

MKL on AMD?
Using the Intel Math Kernel Libraries on an AMD CPU is of course a good way to start some heavy debates. As I pointed out in the last blogpost however, in some cases, the slightly older MKL versions still do a very good job on AMD CPUs when you benchmark with low matrix sizes. You don't have to take my word for it of course.

Compare the Intel Linpack 9.0 (available mid 2007) with the binary that AMD produced at the end of 2007. AMD made a K10 only version using the ACML version 4.0.0, and compiling Linpack with the PGI 7.0.7 compiler (with following flags: pgcc -O3 -fast -tp=barcelona-64).

All the benchmarks below are done on one CPU with 4 GB (AMD, Intel Xeon) or 3 GB (Intel Core i7). Speedstep, Powernow! and Turbo mode were disabled.

As predicted, the ACML binary which was compiled with 2007 compiler is slower than the MKL "2007" version also compiled in 2007. The MKL version runs on any CPU that has support for (S)SSE-3, so it continues to be a very interesting one for us to test. As you can clearly see from the Xeon 5472 (3 GHz) score, it is not fully optimized for the latest 45 nm Intel CPUs with SSE-4. It is a good "not too optimized" version which can be used on both Intel and AMD CPUs.  You can clearly see this as the 3 GHz Xeon 5472 is behind the AMD Opteron 8384. If this Intel Binary was giving the AMD CPUs a badly optimized code path, this would not be possible.

As we move forward to 2008,  we have to create a new binary as both AMD and Intel's fully optimized Linpack versions will not run on the competitor's CPU. Intel released the Linpack benchmark version 10.1, which is not fully optimized for the "Nehalem" architecture, but for 45 nm "Harpertown" family.

AMD has created a new Linpack binary using ACML 4.2 and the PGI 7.2-4 compiler.  Below you see how the two CPUs compare.

Bottom line is that these LINPACK benchmarks are moving targets like the SPEC CPU benchmarks, as the compilers and libraries used are just as important as the CPUs.When the Xeon 5500 will materialize, LINPACK performance will probably be higher as the binary is built for the "Penryn/Harpertown" family.

While it is useful for the HPC people to see which CPU + compiler can offer the best performance, it is also interesting to understand what kind of performance you get when you compile binaries that have to run on all current CPUs. It is pretty hard to compare CPU architectures if you are using totally different binaries.

In the next post we'll delve a bit deeper on what is happening with Hyperthreading, Linpack and the new architectures.

Edison · 发表于 2008-12-3 19:49

本人前不久做的 2.5GHz Phenom DDR2-800 High Performance Linpack 结果：

[root@localhost k10]# /opt/mpich-ether-gnu/bin/mpirun -np 1 ./xhpl
============================================================================
HPLinpack 1.0a  --  High-Performance Linpack benchmark  -- January 20, 2004
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs.,  UTK
============================================================================

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N    : The order of the coefficient matrix A.
NB    : The partitioning blocking factor.
P    : The number of process rows.
Q    : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N    : 20000
NB    :    192
PMAP : Row-major process mapping
P    :    1
Q    :    1
PFACT  : Crout
NBMIN  :    4
NDIV :    2
RFACT  : Crout
BCAST  :  2ringM
DEPTH  :    0
SWAP : Mix (threshold = 64)
L1    : transposed form
U    : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
1) ||Ax-b||_oo / ( eps * ||A||_1  * N       )
2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be       1.110223e-16
- Computational tests pass if scaled residuals are less than          16.0

============================================================================
T/V             N NB    P    Q             Time          Gflops
----------------------------------------------------------------------------
WR03C2C4    20000 192    1    1          153.69       3.471e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N       ) =       0.0225654 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =       0.0218401 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =       0.0041623 ...... PASSED
============================================================================

Finished    1 tests with the following results:
            1 tests completed and passed residual checks,
            0 tests completed and failed residual checks,
            0 tests skipped because of illegal input values.
----------------------------------------------------------------------------

End of Tests.

如果用 HPL 1.1 性能还能再快 1GFLOPS 左右，使用的 BLAS 是 GotoBLAS，今天早上还收到 Kazushige Goto 想 access 我的 core i7 的电子邮件^^（之前他已经专门给我写了一个 for i7 的 Gotoblas）。

bessel · 发表于 2008-12-3 19:53

linpack基本上可以认为是测量cpu理论峰值，不要太在意。

原帖由 lepton 于 2008-11-30 16:18 发表
上海的性能达到同频i7的87%（这个数字已经考虑2.7G与2.66G的频率差异），
感觉上海将K8核心潜能已经挖掘到了极限。
http://images.anandtech.com/graphs/amdshanghai_112208120910/17839.png

原文连接：http:// ...

[ 本帖最后由 bessel 于 2008-12-3 19:56 编辑 ]

bessel · 发表于 2008-12-3 20:08

ht off更高是因为高度优化的linpack代码可以做到很高的单进程instruction per cycle，硬要放2个逻辑核心的话硬件
资源不够，你的所有post的解释全是反的。有些计算是内存瓶颈，但是这里不完全是。提升内存带宽可以提高linpack成绩，
但是提升核心数，4核心对2核心linpack提高更多。

原帖由 tomsmith123 于 2008-12-3 13:16 发表
Linpack 还是非常标志性的指标。HT OFF 性能更高，在我的试验中也重复出现了，我用ICC 最大优化，仍然不会改变这个结果，其实很容易解释，对于密集计算应用，ALU 本身不是瓶颈，而是从RAM 到ALU 的通路瓶颈。

lepton · 发表于 2008-12-8 01:01

提示: 作者被禁止或删除内容自动屏蔽

Edison · 发表于 2008-12-8 01:10

他前天发 email 给我说要远程控制我的 nehalem 系统，但是我是ADSL 拨号上网， ip 可能随时变动...因此我估计他现在还没拿到 Nehalem {wink:]

lepton · 发表于 2008-12-8 01:47

提示: 作者被禁止或删除内容自动屏蔽

Edison · 发表于 2008-12-8 11:37

以前在 Cygwin 下比对过，差别很大，具体的数值忘记了，不过那是 1.02 时代的了。

lepton · 发表于 2008-12-8 16:53

提示: 作者被禁止或删除内容自动屏蔽

bessel · 发表于 2008-12-9 20:37

自己写一小段代码调用blas呗.
i7的blas 1,2一定很好看,blas 3提高不会太多。

原帖由 lepton 于 2008-12-8 16:53 发表
建议你做个测试，纵向、横向比较blas家族。

纵向：比较同频率core2,penryn,i7, 出于公平和使用，都评测物理4核心（包括2路双核，胶水4核，原生4核）。
横向：同一硬件平台比较blas,MKL,atlas,gotoblas。

不过 ...

bessel · 发表于 2008-12-9 20:39

比较的哪个程序，差别有15%么？

原帖由 Edison 于 2008-12-8 11:37 发表
以前在 Cygwin 下比对过，差别很大，具体的数值忘记了，不过那是 1.02 时代的了。

zidane1980 · 发表于 2008-12-9 22:17

我是觉得k8的潜力还没到极限
关键看amd的能力

PlumeOfWind · 发表于 2008-12-11 19:38

话是如此！！！！！！！

Edison · 发表于 2010-11-4 11:13

Core i7 2.67GHz 12GB DDR3-1600 CL8 HPL 2.0 GotoBlas 2 Ubuntu 10.04 x64，39.99GFLOPS 达成，94+% 的效率，太惊人了：

帐号		自动登录	找回密码
密码			注册

demonpumpkin 该用户已被删除	23^# 发表于 2008-12-3 15:36 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
demonpumpkin 该用户已被删除
	回复支持反对使用道具举报显身卡

lepton lepton 当前离线积分 35 IP卡狗仔卡头像被屏蔽	31^# 楼主\| 发表于 2008-12-8 01:01 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
lepton lepton 当前离线积分 35 IP卡狗仔卡头像被屏蔽
	回复支持反对使用道具举报显身卡

lepton lepton 当前离线积分 35 IP卡狗仔卡头像被屏蔽	33^# 楼主\| 发表于 2008-12-8 01:47 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
lepton lepton 当前离线积分 35 IP卡狗仔卡头像被屏蔽
	回复支持反对使用道具举报显身卡

lepton lepton 当前离线积分 35 IP卡狗仔卡头像被屏蔽	35^# 楼主\| 发表于 2008-12-8 16:53 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
lepton lepton 当前离线积分 35 IP卡狗仔卡头像被屏蔽
	回复支持反对使用道具举报显身卡

上海缩小了与nehalem的性能差距，至少linpack测试是如此。

本帖子中包含更多资源

浏览过的版块