POPPUR爱换

标题: [技术求助帖] Sandy Bridge每周期能发射几条AVX指令? 难道缩水了? [打印本页]

作者: itany    时间: 2009-9-28 14:24
标题: [技术求助帖] Sandy Bridge每周期能发射几条AVX指令? 难道缩水了?
本帖最后由 itany 于 2009-9-28 15:45 编辑

如题
昨天一鼓作气把感兴趣的IDF幻灯片都看了

对Nehalem-EX很满意,一个是双路系统,另一个是强悍的内存规格
RAS有的地方没看懂……

对Jasper Forest也比较满意;个人觉得这个才是比较理想的个人桌面高端系统
尤其是PCIe x16+QPI,这样可以划分为:
直接从CPU抽出PCIe x16,和Lynnfield一样;接北桥抽出PCIe x16+双PCIe x16,这样三根全速x16可以上三卡;用QPI上双路,同样有两根PCIe x16,对应顶级用户和工作站

对Sandy Bridge很不满意。
如果按照给出的信息,Sandy每周期只能执行一条AVX指令,而且是将AVX分割成高位和低位,在现有两个浮点SSE发射口上同时发射执行,而不是像原来设想的,直接拉宽度,这样每周期就能反射两条运算指令。
另外真正的性能提升只有每周期能执行一个Load指令,还是通过分割AVX,分别在现有的Load发射端口和Write Address上同时执行。
如果这样的话,Sandy应该不能充分发挥AVX的强大威力,就像SSE真正发威不是在P3上,而是在Core 2上一样。要做到这个,恐怕只能等Gesher了…… 难道就不能一步到位么? 真是失望啊!

呼唤达人来就确认一下,谢谢!
作者: acqwer    时间: 2009-9-28 14:42
我记得Conroe也是每周期执行一条SSE指令。
作者: itany    时间: 2009-9-28 14:58
我记得Conroe也是每周期执行一条SSE指令。
acqwer 发表于 2009-9-28 14:42


Conroe每周期能执行3个整数SSE指令,2个浮点SSE指令
作者: acqwer    时间: 2009-9-28 18:18
Conroe一周期执行一条SSE128指令,AVX想来没什么32位运算的指令吧。
作者: potomac    时间: 2009-9-28 18:46
提示: 作者被禁止或删除 内容自动屏蔽
作者: bessel    时间: 2009-9-28 23:38
有人问同样的问题
http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/68554/

结论是:

Great questions … some more details to the response Max gave.

1)The chart is wrong, we will fix it. Sandy Bridge has true 256-bit FPexecution units (mul, add, shuffle). They are on exactly the sameexecution ports as the 128-bit versions. You can get a 256-bit multiply(on port 0) and a 256-bit add (on port 1) and a 256-bit shuffle (port5) every cycle. 256-bit FP add and multiply bandwidth is therefore 2Xhigher flops than 128. See IACA for the ports on aninstruction-by-instruction basis.
2) The chart doesn’t mention16-byte paths. We have true 32-byte loads (i.e. each load only uses oneAGU resource and we have 2 AGU’s) but only a 48-byte/cycle total issupported to the L1 each cycle. You can’t get 48 bytes per cycle to theDCU using 128-bit operations (only 2 agu’s…). This is why a simplememory-limited kernel like matrix add (load, load, add, store) measures1.42X speedup (would have predicted 1.5X with the current architecturein the limit; vs. 1.0X if we had double pumped).
3) Alignment for128-bit loads/stores is similar to Nehalem. The alignment penalty for256-bit loads/stores is somewhat worse – that’s due to line splits andpage splits. You are much more likely to split with wider loads, soalignment is much more important. That’s why, especially if you canguarantee 16 byte alignment but not 32-byte alignment, it often paysoff to do load128/insertf128 instead of load256. Previous guidance tofavor aligning stores (when you get a choice to align either a load ora store stream) still holds – store page splits are worse than loadpage splits.
4) Masked moves are not harmful, they are provingextremely useful. But they are designed for a specific problem – whenthe exception safety of nonmasked loads/stores can’t be guaranteed.They burn a blend resource, and they aren’t going to disambiguate aswell as normal loads and stores, so I don’t use them when I don’t needthem. If you are a vectorizing compiler, they’re great for peeling andremainder operations, vectorizing code with “if” protecting a possibleexception, etc. If you are a human coder, I doubt you’ll need them: Abit of data overrun padding (often coupled with alignment) paysdividends in speed. You mention doing the blend yourself. Note that avariable blend requires 2 port-5 shuffles so in shuffle-limited codethis doesn’t always win.


Mark Buxton
如题
昨天一鼓作气把感兴趣的IDF幻灯片都看了

对Nehalem-EX很满意,一个是双路系统,另一个是强悍的内存规格
RAS有的地方没看懂……

对Jasper Forest也比较满意;个人觉得这个才是比较理想的个人桌 ...
itany 发表于 2009-9-28 14:24

作者: itany    时间: 2009-9-29 00:44
Conroe一周期执行一条SSE128指令,AVX想来没什么32位运算的指令吧。
acqwer 发表于 2009-9-28 18:18


这个可以看本站的测试
作者: itany    时间: 2009-9-29 02:48
有人问同样的问题
http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/68554/

结论是:

Great questions … some more details to the response Max gave.

1)The chart is wro ...
bessel 发表于 2009-9-28 23:38


谢谢指教!
非常感谢!

看来Intel还是地道的,没有缩水




欢迎光临 POPPUR爱换 (https://we.poppur.com/) Powered by Discuz! X3.4