终于第一款ARM64出来了

frankincense · 发表于 2013-10-18 09:49

ifu 发表于 2013-10-18 09:24
Silvermont是有手机产品线规划的，手机soc这块比平板大 intel怎么可能无视。现在上市的成品Silvermont平板 ...

部分板用3740是定位问题，用3770的又不是没有

qqisqq · 发表于 2013-10-18 10:01

本帖最后由 qqisqq 于 2013-10-18 10:04 编辑

futuremark对于物理测试得分的解释：

We've looked at this at great length, and the result is quite interesting. We are working with our relevant partners on this to verify, but at current it seems that:

The compiler does do SIMD and Neon. These do not offer any effect. We even tried doing Neon optimizing by hand to see if the compiler is not doing what it should, but no real effect

The Physics test is uses Bullet, and practically the whole CPU time is spent in the soft body solver, PSolve_links. If you pull this function out of Bullet and bench it separately, you do see a 2x speed increase. However, once it's inside the physics engine, you see nothing.

As this seemed to make no sense, we spent a few days trying to understand what is happening. The result seems to be that if the soft bodies are arranged in memory so that the CPU can access them in a sequential fashion, you get a 2x to 3x increase in speed. This is higher if it can run up the memory, a bit lower if it runs down. The way bullet places the bodies in the memory is a lot more random and they are accesses in a jump-back-and-forth manner. When memory is accessed in this way, all speed gains are lost.

iPhone 5 shows none of this behaviour. It is realistic to assume that in the new 5s we see the new prefetch in action, but it cannot gain traction with a random memory access pattern.

It's good to understand that this is not a flaw in Bullet. Arranging complex memory structures in memory to be in a sequential fashion is non-trivial to say the least. Our hacked solution only worked for us as we knew exactly the data that we would be using. Worrying about where your memory segments lie is not something the programmer should have to worry about anyway.

In terms of our Physics test at large, it does not appear that we have an inherent flaw - our use-case is simply such that no gain is seen. Any game/app using Bullet would see the same, and there are naturally other apps that will see the same (in GeekBench, you can see a few tests where the same thing is happening).

ifu · 发表于 2013-10-18 10:09

frankincense 发表于 2013-10-18 09:48
Octane这个单线程测试，Chrome 30，win8.1 x64
：5300分
：27000分

完全盒装正品

看了一下得分是有些低，有空再跑个Webxprt看看

ifu · 发表于 2013-10-18 10:16

qqisqq 发表于 2013-10-18 10:01
futuremark对于物理测试得分的解释：

We've looked at this at great length, and the result is quite ...

这解释挺好挺专业比满嘴跑飞机的JoseyWales强多了
btw:人家也没有无视Geekbench嘛

acqwer · 发表于 2013-10-18 10:26

ifu 发表于 2013-10-18 10:16
这解释挺好挺专业比满嘴跑飞机的JoseyWales强多了
btw:人家也没有无视Geekbench嘛[lol>

水果上面也就这几个测试，别人好歹是真正的业内，自然不会拆其他人的台。

largewc · 发表于 2013-10-18 10:31

qqisqq 发表于 2013-10-18 10:01
futuremark对于物理测试得分的解释：

We've looked at this at great length, and the result is quite ...

看起来大概意思就是a7优化了连续内存访问，如果随即内存访问就跟a6一样没有提升。

连续内存还是有一定价值的，流是连续内存使用较为频繁的地方，这个特性可能会对流数据操作有一定加成，视频，压缩编解码之类有一定加成。

largewc · 发表于 2013-10-18 10:36

ifu 发表于 2013-10-18 10:16
这解释挺好挺专业比满嘴跑飞机的JoseyWales强多了
btw:人家也没有无视Geekbench嘛[lol>

其实也就是a7对流处理有大量优化，对于逻辑部分加成有限。

我觉得这个有点本末倒置，流处理未来的趋势还是gpu合适，统一寻址intel 2015年也会到来，这玩意未来应该是gpu干的，而不是cpu，gpu做这个比cpu有本质的提升。

qqisqq · 发表于 2013-10-18 10:37

largewc 发表于 2013-10-18 10:31
看起来大概意思就是a7优化了连续内存访问，如果随即内存访问就跟a6一样没有提升。

大概意思是这样。
不过他同时也说了将内存存取改为顺序方式是非常规的做法，这不是程序员必须考虑的事情。

qqisqq · 发表于 2013-10-18 10:39

ifu 发表于 2013-10-18 10:16
这解释挺好挺专业比满嘴跑飞机的JoseyWales强多了
btw:人家也没有无视Geekbench嘛[lol>

人家是用GeekBench中哪些性能没有增长的测试来证实他的观点的吧。

largewc · 发表于 2013-10-18 10:42

本帖最后由 largewc 于 2013-10-18 10:45 编辑

qqisqq 发表于 2013-10-18 10:37
大概意思是这样。
不过他同时也说了将内存存取改为顺序方式是非常规的做法，这不是程序员必须考虑的事情 ...

有些地方改不了内存顺序，物理是较为典型的，因为东西总是在动的，大量动态增加删除的东西，内存树结构也是在动的。

连续内存就是对流数据有意义，但是这东西我认为不会是cpu主要的应用点。

不过对于a7意义较大，这样就可以大幅度改观a7双核软解的软肋了。

acqwer · 发表于 2013-10-18 10:46

largewc 发表于 2013-10-18 10:42
有些地方改不了内存顺序，物理是较为典型的，因为东西总是在动的，大量动态增加删除的东西，内存树结构也 ...

但是提高连续内存性能对某些跑分很有效，比如说geekbench，连续内存性能和随机内存性能的评分权重是一样的。

ifu · 发表于 2013-10-18 10:47

largewc 发表于 2013-10-18 10:36
其实也就是a7对流处理有大量优化，对于逻辑部分加成有限。

错，对逻辑部分是大大的加成
If you pull this function out of Bullet and bench it separately, you do see a 2x speed increase.
The result seems to be that if the soft bodies are arranged in memory so that the CPU can access them in a sequential fashion, you get a 2x to 3x increase in speed

frankincense · 发表于 2013-10-18 10:47

largewc 发表于 2013-10-18 10:36
其实也就是a7对流处理有大量优化，对于逻辑部分加成有限。

比较可能的是，流处理这类A7内部是交给GPU去跑了

largewc · 发表于 2013-10-18 10:50

本帖最后由 largewc 于 2013-10-18 10:54 编辑

ifu 发表于 2013-10-18 10:47
错，对逻辑部分是大大的加成
If you pull this function out of Bullet and bench it separately, you ...

还是这个，顺序内存有加成，但是大部分程序逻辑是不可能顺序的

a7对于流处理有加成是可以理解的了，比如说jpeg加载之类，这些顺序项目加成较多，浏览器性能因此也会被加成。

不过未来仍然我认为没前途，因为jpeg加载这类东西，ie11已经变成gpu加载，gpu做这些比cpu有本质提升，顺序流处理未来将是apu这种构架的天下。

frankincense的说法也靠谱，可能苹果编译的代码已经被gpu加速了，这是苹果的优势，统一的优势。

largewc · 发表于 2013-10-18 10:52

本帖最后由 largewc 于 2013-10-18 11:02 编辑

frankincense 发表于 2013-10-18 10:47
比较可能的是，流处理这类A7内部是交给GPU去跑了

有可能，也许是苹果以前提倡的opencl在这个版本改进了，成为c++默认选项了，循环被默认gpu加速了。

因为pc的dx11仍然不是强制标准，不能被默认编译，安卓离得则更远。

这个确实是苹果的优势。

如果是这样就讽刺了，amd极力推广的apu优势，被苹果发扬了。

ifu · 发表于 2013-10-18 11:11

3Dmark这个iphone5s这个物理得分低的原因还在于随机访问造成的cache miss。
A7的运算资源已经足够丰富，futuremark也提到了在顺序执行时能获得2x-3x的性能加成。
真正的随机访问对谁来说都是无解，不可预测的。
一般来说cache命中和不命中的执行速度差了一到两个数量级。如果每次访存都是一次miss那A7再强的执行资源也是白搭，对于A7来说3Dmark的物理测试就变成了随机访存测试
可能提高A7在这种随机访存测试中成绩的方法：
1)提高主频。主频高了L/S执行频率也就多了，但L/S始终是瓶颈
2)增加cachesize 现在A7是1MB L2 ，增加到2MB或者更多 L2也许就能涵盖这测试的数据规模
3)再加一组L/S...
4)加核，也就相当于多一组L/S。

largewc · 发表于 2013-10-18 11:14

本帖最后由 largewc 于 2013-10-18 11:15 编辑

ifu 发表于 2013-10-18 11:11
3Dmark这个iphone5s这个物理得分低的原因还在于随机访问造成的cache miss。
A7的运算资源已经足够丰富，fu ...

我觉得frankincense的说法更靠谱，不是cpu提升的，而是gpu。
amd一直推行的apu在pc举步维艰，但是苹果优先实用化了。

ifu · 发表于 2013-10-18 11:15

本帖最后由 ifu 于 2013-10-18 11:17 编辑

largewc 发表于 2013-10-18 10:50
还是这个，顺序内存有加成，但是大部分程序逻辑是不可能顺序的

大部分程序数据局部性很好，3dmark这事在于数据规模超出了A7的1MB L2 cache

largewc · 发表于 2013-10-18 11:17

本帖最后由 largewc 于 2013-10-18 11:18 编辑

ifu 发表于 2013-10-18 11:15
大部分程序数据局部性很好，3dmark这事在于数据规模超出了A7的1MB

1mb?随便一个解压就超过了，随便一个图片处理，或者脚本都能超过，3dmark绝对跟内存量没关。

这个确实可以对大量的顺序进行加成的，这个是肯定的，我认为apu的模式是趋势，未来的c++编译器应该可以自动为gpu优化才行。

之前ps4测试过一个，忘了哪里看到的，就是解压也用了apu的gpu加速模式，速度根本不是cpu能比的。

frankincense · 发表于 2013-10-18 11:18

largewc 发表于 2013-10-18 11:14
我觉得frankincense的说法更靠谱，不是cpu提升的，而是gpu。
amd一直推行的apu在pc举步维艰，但是苹果优 ...

APU也就刚刚实现统一内存寻址
Intel未正式支持内部GPU加速，软件就不会那么快跟上的

帐号		自动登录	找回密码
密码			注册

终于第一款ARM64出来了

浏览过的版块