|
有人问同样的问题
http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/68554/
结论是:
Great questions … some more details to the response Max gave.
1)The chart is wrong, we will fix it. Sandy Bridge has true 256-bit FPexecution units (mul, add, shuffle). They are on exactly the sameexecution ports as the 128-bit versions. You can get a 256-bit multiply(on port 0) and a 256-bit add (on port 1) and a 256-bit shuffle (port5) every cycle. 256-bit FP add and multiply bandwidth is therefore 2Xhigher flops than 128. See IACA for the ports on aninstruction-by-instruction basis.
2) The chart doesn’t mention16-byte paths. We have true 32-byte loads (i.e. each load only uses oneAGU resource and we have 2 AGU’s) but only a 48-byte/cycle total issupported to the L1 each cycle. You can’t get 48 bytes per cycle to theDCU using 128-bit operations (only 2 agu’s…). This is why a simplememory-limited kernel like matrix add (load, load, add, store) measures1.42X speedup (would have predicted 1.5X with the current architecturein the limit; vs. 1.0X if we had double pumped).
3) Alignment for128-bit loads/stores is similar to Nehalem. The alignment penalty for256-bit loads/stores is somewhat worse – that’s due to line splits andpage splits. You are much more likely to split with wider loads, soalignment is much more important. That’s why, especially if you canguarantee 16 byte alignment but not 32-byte alignment, it often paysoff to do load128/insertf128 instead of load256. Previous guidance tofavor aligning stores (when you get a choice to align either a load ora store stream) still holds – store page splits are worse than loadpage splits.
4) Masked moves are not harmful, they are provingextremely useful. But they are designed for a specific problem – whenthe exception safety of nonmasked loads/stores can’t be guaranteed.They burn a blend resource, and they aren’t going to disambiguate aswell as normal loads and stores, so I don’t use them when I don’t needthem. If you are a vectorizing compiler, they’re great for peeling andremainder operations, vectorizing code with “if” protecting a possibleexception, etc. If you are a human coder, I doubt you’ll need them: Abit of data overrun padding (often coupled with alignment) paysdividends in speed. You mention doing the blend yourself. Note that avariable blend requires 2 port-5 shuffles so in shuffle-limited codethis doesn’t always win.
Mark Buxton
如题
昨天一鼓作气把感兴趣的IDF幻灯片都看了
对Nehalem-EX很满意,一个是双路系统,另一个是强悍的内存规格
RAS有的地方没看懂……
对Jasper Forest也比较满意;个人觉得这个才是比较理想的个人桌 ...
itany 发表于 2009-9-28 14:24 ![]() |
|