英特尔 Larrabee 体系架构讨论主题

eDRAM · 发表于 2008-8-5 00:38

原帖由 Edison 于 2008-8-5 00:34 发表
新闻稿出来了:)

http://www.intel.com/pressroom/archive/releases/20080804fact.htm?cid=rss-90004-c1-210459

The Larrabee architecture uses a 1024 bits-wide, bi-directional ringnetwork (i.e., 512 bit ...

请教E大，LBB是否强过GTX280？

Edison · 发表于 2008-8-5 00:40

原帖由 eDRAM 于 2008-8-5 00:38 发表
请教E大，LBB是否强过GTX280？

通用部分应该更强，游戏部分未知，不过它面对的是AMD/NVIDIA的下一代架构的产品，简单地拿GTX280对比意义不是很大。

cxj3000 · 发表于 2008-8-5 00:41

看LS问的那么辛苦，我就偷偷告诉你吧：强！

Edison · 发表于 2008-8-5 00:41

Larrabee 的 paper 出来了，其中有提到专门的指令和指令模式实现显式 cache 控制：

Larrabee also adds new instructions and instruction modes for
explicit cache control. Examples include instructions to prefetch
data into the L1 or L2 caches and instruction modes to reduce the
priority of a cache line. For example, streaming data typically
sweeps existing data out of a cache. Larrabee is able to mark each
streaming cache line for early eviction after it is accessed. These
cache control instructions also allow the L2 cache to be used
similarly to a scratchpad memory, while remaining fully coherent.

Within a single core, synchronizing access to shared memory by
multiple threads is inexpensive. The threads on a single core share
the same local L1 cache, so a single atomic semaphore read
within the L1 cache is sufficient. Synchronizing access between
multiple cores is more expensive, since it requires inter-processor
locks. This is a well known problem in multi-processor design.

Multi-issue CPU cores often lose performance due to the
difficulty of finding instructions that can execute together.
Larrabee’s dual-issue decoder has a high multi-issue rate in code
that we’ve tested. The pairing rules for the primary and secondary
instruction pipes are deterministic, which allows compilers to
perform offline analysis with a wider scope than a runtime out-oforder
instruction picker can. All instructions can issue on the
primary pipeline, which minimizes the combinatorial problems
for a compiler. The secondary pipeline can execute a large subset
of the scalar x86 instruction set, including loads, stores, simple
ALU operations, branches, cache manipulation instructions, and
vector stores. Because the secondary pipeline is relatively small
and cheap, the area and power wasted by failing to dual-issue on
every cycle is small. In our analysis, it is relatively easy for
compilers to schedule dual-issue instructions.

Prescott · 发表于 2008-8-5 00:42

原帖由 eDRAM 于 2008-8-5 00:38 发表

请教E大，LBB是否强过GTX280？

那片论文一发布你就知道了。汗，不是上面那篇。

[ 本帖最后由 Prescott 于 2008-8-5 00:49 编辑 ]

Edison · 发表于 2008-8-5 00:52

原帖由 Prescott 于 2008-8-5 00:42 发表
那片论文一发布你就知道了。汗，不是上面那篇。

我的是这篇，不是吗？

http://portal.acm.org/citation.cfm?doid=1360612.1360617

Prescott · 发表于 2008-8-5 00:57

原帖由 Edison 于 2008-8-5 00:52 发表

我的是这篇，不是吗？

http://portal.acm.org/citation.cfm?doid=1360612.1360617

哦，不太一样
不过Figure 10已经够分析出来LRB的DX性能了
跑60帧每秒所需要的Larrabee单位（1GHz的一个Larrabee是一个Larrabee单位，所以，32核心，2Ghz的Larrabee大概是64个Larrabee单位）
HL2：10个
FEAR和Gear Of War：25个。

[ 本帖最后由 Prescott 于 2008-8-5 01:02 编辑 ]

Edison · 发表于 2008-8-5 01:04

看来他们测试的片段变化是比较大的样子，GOW 部分我这里有一个截图，1920x1200 4AA 16AF 9800GX2：

http://www.pcinlife.com/article/ ... 07107328d527_1.html

天下18 · 发表于 2008-8-5 01:05

提示: 作者被禁止或删除内容自动屏蔽

Edison · 发表于 2008-8-5 01:06

原帖由 天下18 于 2008-8-5 01:05 发表
哦，看起来还不错，是什么样的设置？分辨率和AA AF

LRB 的那个图是 1600x1200 4 sample 。

Prescott · 发表于 2008-8-5 01:14

原帖由 Edison 于 2008-8-5 01:06 发表

LRB 的那个图是 1600x1200 4 sample 。

看样子，E大有兴趣拿顶级卡跑1600x1200 4sample的FEAR，HL2 ep2和GoW的详细测试了，哈哈

Edison · 发表于 2008-8-5 01:19

原帖由水银于 2008-8-5 01:12 发表
怎么没有关于TMU和ROP的叙述？

LRB 没有 ROP，"软件" 实现，不过有专门的 TMU，这个 TMU 类似其他 gpu 上的 TMU，每个 LRB 的 core 有32 KB tex-cache （类似 G80 的 TPC/ RV770 的 SC？），当然如果某些纹理操作在 Core 上足够快的话，也可以在 Core 上执行。

Edison · 发表于 2008-8-5 01:21

原帖由 Prescott 于 2008-8-5 01:14 发表
看样子，E大有兴趣拿顶级卡跑1600x1200 4sample的FEAR，HL2 ep2和GoW的详细测试了，哈哈

前两个就没有意思了，DX9， GOW 需要破解 frame rate cap，比较麻烦，还是跑刺客信条之类的 UE3-based 游戏算了。。。

Intel 的测试数据是不同场景的 frame，所以就算想大致比对，还是会有很大的差异存在。

只看该作者 · 发表于 2008-8-5 10:03

提示: 作者被禁止或删除内容自动屏蔽

991060 · 发表于 2008-8-5 10:17

http://softwarecommunity.intel.c ... rrabee_manycore.pdf

ITANIUM2 · 发表于 2008-8-5 11:15

原帖由 Edison 于 2008-8-4 20:27 发表
Larrabee 的驱动团队不是 GMA 那班家伙，而是自己专配的软件工程师 + 3DLabs 的大型团队，这次Intel下血本了。

http://anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3367&p=15

I asked Intel who was wor ...

强啊，爆强
当年3DLabs的驱动就很强，可惜硬件不行，这下人尽其用了

Edison · 发表于 2008-8-5 11:19

http://softwarecommunity.intel.com/articles/eng/3803.htm

Edison · 发表于 2008-8-5 11:25

原帖由 ITANIUM2 于 2008-8-5 11:15 发表
强啊，爆强
当年3DLabs的驱动就很强，可惜硬件不行，这下人尽其用了

我怎么觉得是 3D Labs 的硬件很强而软件不行呢，例如他们当年的 P10 就是 SIMD scalar、具备 Geometry Shader 并且是最早纸面上实现的 256-bit 内存总线，但是实际的游戏表现可以用惨不忍睹可以形容的。

ITANIUM2 · 发表于 2008-8-5 11:32

原帖由 Edison 于 2008-8-5 11:25 发表

我怎么觉得是 3D Labs 的硬件很强而软件不行呢，例如他们当年的 P10 就是 SIMD scalar、具备 Geometry Shader 并且是最早纸面上实现的 256-bit 内存总线，但是实际的游戏表现可以用惨不忍睹可以形容的。

野猫系列显卡主要是面向open gl 应用，对游戏支持非常差。当年用过野猫7110卡，不支持directx，玩帝国时代都只能在虚拟机里玩，汗

但是做计算，一边算一边显示时，cpu占用率非常低，其他显卡就做不到

你说的这个我不了解，有可能我说错了，sorry

the_god_of_pig · 发表于 2008-8-5 11:46

原帖由 Prescott 于 2008-8-5 00:57 发表

哦，不太一样
不过Figure 10已经够分析出来LRB的DX性能了
跑60帧每秒所需要的Larrabee单位（1GHz的一个Larrabee是一个Larrabee单位，所以，32核心，2Ghz的Larrabee大概是64个Larrabee单位）
HL2：10个
FEAR和 ...

最好这个描述不全面，不然gtx280出了X2也干不过Larrabee{lol:]

帐号		自动登录	找回密码
密码			注册

天下18 天下18 当前离线积分 24 IP卡狗仔卡头像被屏蔽	189^# 发表于 2008-8-5 01:05 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
天下18 天下18 当前离线积分 24 IP卡狗仔卡头像被屏蔽
	回复支持反对使用道具举报显身卡

RacingPHT 该用户已被删除	194^# 发表于 2008-8-5 10:03 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
RacingPHT 该用户已被删除
	回复支持反对使用道具举报显身卡

英特尔 Larrabee 体系架构讨论主题

本帖子中包含更多资源

浏览过的版块