Futuremark已经解释了为啥A7物理得分低的原因。个人解读一下，欢迎讨论

ifu · 发表于 2013-10-18 11:51

欢迎讨论，谢绝无脑喷

Futuremark已经解释了为啥A7物理得分低的原因,个人认为还是说得通的。
3Dmark这个iphone5s这个物理得分低的原因还在于随机访问造成的cache miss。学CS的大都应该明白这cache miss的恐怖性。
A7的运算资源已经足够丰富，futuremark也提到了单执行物理运算时能获得>2x的性能加成。
3DMark测试中物理运算部分都是随机访问数据。这真正的随机访问对谁来说都是无解，不可预测的。
一般来说cache命中和不命中的执行速度差了一到两个数量级。如果每次访存都是一次miss那A7再强的执行资源也是白搭，对于A7来说3Dmark的物理测试就变成了随机访存测试
可能提高A7在这种随机访存测试中成绩的方法：
1)提高主频。主频高了L/S执行频率也就多了，但L/S始终是瓶颈
2)增加cachesize 现在A7是1MB L2 。增加到2MB或者更多 L2也许就能涵盖这测试的数据规模，至少有助于减少cache miss.
3)再加一组L/S...
4)加核，也就相当于多一组L/S。

那么为什么对baytrail S800 A15，这些影响不大呢?可能原因如下:
1)主频相对它们的前一代有提升，间接增加了L/S的频率
2)核多，间接增加了L/S单元。3Dmark这种物理测试本身并行性很好，不会有什么访存冲突
3)Cache够大。z3770 tegra4 S800这些都是2MB L2 cache。或者2MB刚好涵盖了3Dmark的数据规模，不过这个天晓得。但cache够大在随机访存测试中总是占便宜的

http://community.futuremark.com/forum/showthread.php?177840-Why-iphone5-and-iphone5S-share-same-physics-score/page3
We've looked at this at great length, and the result is quite interesting. We are working with our relevant partners on this to verify, but at current it seems that:

The compiler does do SIMD and Neon. These do not offer any effect. We even tried doing Neon optimizing by hand to see if the compiler is not doing what it should, but no real effect

The Physics test is uses Bullet, and practically the whole CPU time is spent in the soft body solver, PSolve_links. If you pull this function out of Bullet and bench it separately, you do see a 2x speed increase. However, once it's inside the physics engine, you see nothing.

As this seemed to make no sense, we spent a few days trying to understand what is happening. The result seems to be that if the soft bodies are arranged in memory so that the CPU can access them in a sequential fashion, you get a 2x to 3x increase in speed. This is higher if it can run up the memory, a bit lower if it runs down. The way bullet places the bodies in the memory is a lot more random and they are accesses in a jump-back-and-forth manner. When memory is accessed in this way, all speed gains are lost.

iPhone 5 shows none of this behaviour. It is realistic to assume that in the new 5s we see the new prefetch in action, but it cannot gain traction with a random memory access pattern.

It's good to understand that this is not a flaw in Bullet. Arranging complex memory structures in memory to be in a sequential fashion is non-trivial to say the least. Our hacked solution only worked for us as we knew exactly the data that we would be using. Worrying about where your memory segments lie is not something the programmer should have to worry about anyway.

In terms of our Physics test at large, it does not appear that we have an inherent flaw - our use-case is simply such that no gain is seen. Any game/app using Bullet would see the same, and there are naturally other apps that will see the same (in GeekBench, you can see a few tests where the same thing is happening).

ifu · 发表于 2013-10-18 12:01

这图说明很有可能在A6的时候就已经撞上了访存墙。A7因为更强的数据预取导致性能反而下降，不命中的数据预取只能是负分

acqwer · 发表于 2013-10-18 12:12

说以说Z2580 512K*2，瓶颈更明显。

the_god_of_pig · 发表于 2013-10-18 12:21

没能么复杂，apple水平不够把U做畸形了而已，执行单元堆地厉害，结果内存预取一坨

后果就是简单的数学跑分美地很，结果一到大型复杂负载就吃翔了

the_god_of_pig · 发表于 2013-10-18 12:23

按这原理，估计跑SPEC也得吃翔

ifu · 发表于 2013-10-18 12:24

acqwer 发表于 2013-10-18 12:12
说以说Z2580 512K*2，瓶颈更明显。

Z2580 主频高 2G, A7只有1.3G
如果每次都cache miss 单纯测随机访存那2G的肯定占便宜。
瓶颈这玩意要具体量化分析的，各个处理器的情况还不太一样。有的处理器可能运算资源是瓶颈，有的可能是访存。
A7这个Futuremark自己也测过移除随机访存限制后速度快了2倍以上。

嗯，哪有Z2580的测试结果？我看看

acqwer · 发表于 2013-10-18 12:26

ifu 发表于 2013-10-18 12:24
Z2580 主频高 2G, A7只有1.3G
如果每次都cache miss 单纯测随机访存那2G的肯定占便宜。
瓶颈这玩意要具 ...

3dmark官网，10000分出头吧

ifu · 发表于 2013-10-18 12:28

the_god_of_pig 发表于 2013-10-18 12:21
没能么复杂，apple水平不够把U做畸形了而已，执行单元堆地厉害，结果内存预取一坨

后果就是简单的数学跑 ...

水果的数据预取已经很厉害，你可以看一下geekbench的访存部分成绩。
要真随机谁来了也得吃瘪，能预测就不是随机了。

ifu · 发表于 2013-10-18 12:30

the_god_of_pig 发表于 2013-10-18 12:23
按这原理，估计跑SPEC也得吃翔[shifty>

SPEC里面的数据访问还是很有规律的，再加上用数据集training一下想miss都很难。

eternal0 · 发表于 2013-10-18 12:33

说白了就是以前赛扬和奔腾的差距，缓存在某些应用上有巨大的影响力，服务器U动辄30M的L3也不是摆设。

acqwer · 发表于 2013-10-18 12:36

eternal0 发表于 2013-10-18 12:33
说白了就是以前赛扬和奔腾的差距，缓存在某些应用上有巨大的影响力，服务器U动辄30M的L3也不是摆设。

问题是3dmark物理测试并不是一个缓存敏感的测试，所以他的分析明显是错误的。

the_god_of_pig · 发表于 2013-10-18 12:38

ifu 发表于 2013-10-18 12:28
水果的数据预取已经很厉害，你可以看一下geekbench的访存部分成绩。
要真随机谁来了也得吃瘪，能预测就不 ...

不要提geekbench了行吗？开个微架构讨论贴结果拿geekbench说事儿？

the_god_of_pig · 发表于 2013-10-18 12:39

ifu 发表于 2013-10-18 12:30
SPEC里面的数据访问还是很有规律的，再加上用数据集training一下想miss都很难。

扯吧，SPEC不miss？当年k8就靠个IMC就日了P4你以为靠的是什么？

ifu · 发表于 2013-10-18 12:43

acqwer 发表于 2013-10-18 12:36
问题是3dmark物理测试并不是一个缓存敏感的测试，所以他的分析明显是错误的。

目前这个3DMark是一个基于随机访存的数值计算benchmark，Futuremark的工作人员也指出了这一点。
我也并不是只把和缓存挂钩。如果3Dmark 的data fooprint足够大足够随机，那就不会和cache size挂钩，那就和L/S挂钩了。
但3Dmark 的data fooprint只有它自己的工作人员知道，并且也不会透露的。

ifu · 发表于 2013-10-18 12:46

the_god_of_pig 发表于 2013-10-18 12:39
扯吧，SPEC不miss？当年k8就靠个IMC就日了P4你以为靠的是什么？

没有完全不miss的除非完全塞入cache，但是通过training会尽可能减少miss.你以为Profiling是干啥？

ifu · 发表于 2013-10-18 12:47

eternal0 发表于 2013-10-18 12:33
说白了就是以前赛扬和奔腾的差距，缓存在某些应用上有巨大的影响力，服务器U动辄30M的L3也不是摆设。

不只是缓存，缓存只是可能性中的一种

ifu · 发表于 2013-10-18 12:51

the_god_of_pig 发表于 2013-10-18 12:38
不要提geekbench了行吗？开个微架构讨论贴结果拿geekbench说事儿？

geekbench当然有意义。如果连基本规律访存预期都做得不够好，那么在geekbench访存部分就得挂。
跑好geekbench的处理器不一定牛，但跑不好geekbench的处理器一定弱。

acqwer · 发表于 2013-10-18 12:58

ifu 发表于 2013-10-18 12:51
geekbench当然有意义。如果连基本规律访存预期都做得不够好，那么在geekbench访存部分就得挂。
跑好geek ...

这句话纯粹是废话，跑不跑得好本来就是相对的，处理器A比处理器B跑的分高，就是处理器A跑的好，没比较你怎么看得出谁跑的好？

ifu · 发表于 2013-10-18 12:59

acqwer 发表于 2013-10-18 12:58
这句话纯粹是废话，跑不跑得好本来就是相对的，处理器A比处理器B跑的分高，就是处理器A跑的好，没比较你怎 ...

所以geekbench是有意义的

acqwer · 发表于 2013-10-18 13:06

ifu 发表于 2013-10-18 12:59
所以geekbench是有意义的

geekbench可是能测出1M L2的PE和X2内存性能相当的结果，这个“内存”测试的意义何在？

帐号		自动登录	找回密码
密码			注册

Futuremark已经解释了为啥A7物理得分低的原因。个人解读一下，欢迎讨论

相关帖子