Nvidia has hit out at claims it's deliberately hobbling CPU PhysX, describing the reports as "factually inaccurate."
Speaking to THINQ, Nvidia's senior PR manager Bryan Del Rizzo said "any assertion that we are somehow hamstringing the CPU is patently false." Del Rizzo states that "Nvidia has and will continue to invest heavily on PhysX performance for all platforms - including CPU-only on the PC."
The response follows a recent report on the Web claiming CPU-PhysX was unnecessarily reliant on x87 instructions, rather than SSE. The report also suggested PhysX wasn't properly multi-threaded, with the test benchmarks showing a dependence on just one CPU core.
Let's start with multi-threading, which Del Rizzo says is readily available in CPU-PhysX, and "it's up to the developer to allocate threads as they see fit based on their needs." He points out that you only need to look at the scaling shown in the CPU tests in 3DMark Vantage and FluidMark to see that CPU-PhysX is perfectly capable of scaling performance as more cores are added.
However, he notes that the current PhysX 2.x code "dates back to a time when multi-core CPUs were somewhat of a rarity," explaining why CPU-PhysX isn't automatically multi-threaded by default. Yet despite this, Del Rizzo says it's easy enough for a developer to implement multi-threading in PhysX 2.x.
"There are some flags that control the use of 'worker threads' which perform functions in the rigid body pipeline," he says as an example, "and each NxScene runs in a separate thread."
The point appears to be moot in the long-term anyway, as Nvidia is apparently planning to introduce new automatic multi-threading features in the forthcoming PhysX 3.0 SDK.
This "uses a task-based approach that was developed in conjunction with our Apex product to add in more automatic support for multi-threading," explains Dell Rizzo.
The new SDK will automatically take advantage of however many cores are available, or the number of cores set by the developer, and will also provide the option of a "thread pool" from which "the physics simulation can draw resources that run across all cores."
In addition to the new multi-threading features, Del Rizzo also says "SSE will be turned on by default" in the new SDK. However, he notes that "not all developers want SSE enabled by default, because they still want support for older CPUs for their software versions." The original
Why do games developers still want to provide support for CPUs that are over ten years old? Del Rizzo says it's up to the game devs and what they demand, but he reiterates that it's definitely not a deliberate attempt to hobble CPU PhysX.The original report by Real World Technologies showed the Dark Basic PhysX Soft Body demo (below) made heavy use of x87 instructions, rather than SSE.
"We have hundreds of developers who are using PhysX in their applications," he says, "and we have a responsibility to ensure we do not break compatibility with any platforms once they have shipped. Historically, we couldn't become dependent on any hardware feature like SSE after the first revision has shipped."
He also points out that the PhysX 2.x SDK does feature at least some SSE code, and SSE isn't necessarily faster anyway. "We have found sometimes non-SSE code can result in higher performance than SSE vector code in many situations," he says. However, in the long-term SSE will apparently be the way forward for CPU-PhysX in the long term. "We will continue to use SSE and we plan to enable it by default in future releases," says Del Rizzo.
In short, it looks as though there's a fair bit of legacy detritus in the current PhysX SDK, partly due to the demands from games devs. Nevertheless, there are already ways in which developers can use multi-threading in CPU-PhysX, and full SSE support and improved multi-threading will be coming shortly.
This doesn't look like a company trying to deliberately cripple CPU-PhysX to make its GPUs look good.
One of the latest challenges in computer gaming is modeling the game environment, with a high degree of realism. The most gaming obvious improvements in the last 25 years have been graphical – from the early days of 2D sprites like Metroid or King's Quest, to 3D rendering with Glide and later DirectX and OpenGL, powering the latest games like Crysis. Features such as multi-sample anti-aliasing and anisotropic filtering produce more attractive images and increasing amounts of effort and computational capacity are spent on accurately portraying difficult phenomena such as smoke, water reflections, hair and shadows. However, an accurate visualization of an object is only as convincing and realistic as the modeling of the object itself; a glass hurled against a wall that bounces away harmlessly is unlikely to be convincing, no matter how beautifully rasterized (or ray traced). Consequently, as graphics have improved, modeling the underlying behavior becomes increasingly important. This article delves into the recent history of real-time game physics libraries (specifically PhysX), and analyzes the performance characteristics of PhysX. In particular, through our experiments we found that PhysX uses an exceptionally high degree of x87 code and no SSE, which is a known recipe for poor performance on any modern CPU.
History of PhysX
Advances in semiconductor manufacturing have pushed Moore's Law inexorably forward over the last 25 years. Graphics, by it's nature, is a trivially parallel application and has taken full advantage of the additional transistors and integration offered by Moore's Law. Each generation of new hardware is increasingly more powerful and enables new levels of graphical quality. In this context, real-time physics engines were developed to help games accurately model the behavior of objects, according to the relevant laws of physics (e.g. Newtonian mechanics). In 2006, Ageia, a little known hardware start up out of Washington University in St. Louis, launched a dedicated coprocessor for physics. Ageia’s physics engine was cleverly known as PhysX and ran on a specialized Physics Processing Unit (PPU). Ageia was hoping and betting (unwisely) that the PPU would transform the video game industry in the same way that 3dfx’s Voodoo graphics cards did in the 1990’s. Unfortunately, there is not room for more than two processors in a modern PC, and the CPU and GPU have already made their mark. Even two separate processors is somewhat dubious, given that the vast majority of the market is comfortable with integrated graphics. More problematic, software developers were reluctant to fully embrace the PhysX API, given that few gamers were buying the hardware – the perennial chicken-and-egg problem. Unsurprisingly, the company was bought at fire sale prices by Nvidia in 2008…so in that sense, Ageia did live up the legacy of 3dfx.
The Ageia PPU (below in Figure 1) itself was not particularly revolutionary. In many respects, it resembled Sony's Emotion Engine in the PS2 or the Cell in the PS3; it was a primitive throughput optimized processor, albeit with a familiar instruction set. The PPU had a 32-bit MIPS control processor with many vector execution units. It was 183mm2 manufactured on a 130nm TSMC process, which was hardly modern; contemporary CPUs were using a 65nm process.
Figure 1 - Ageia’s Physics Processing Unit (PPU), courtesy of the Tech Report
At Nvidia, the PPU was wisely discontinued and PhysX was ported to the proprietary CUDPro-Agramming environment. The goal was to execute on Nvidia’s GPUs (instead of the PPU) and demonstrate the benefits of GPU computing (sometimes known as GPGPU) for consumers. One of the advantages of executing on an Nvidia GPU rather than the PPU is that many gamers actually own Nvidia GPUs; thus solving part of the chicken-and-egg problem for software developers. Of course, PhysX can also execute on the CPU, albeit with reduced performance. This guarantees that games written with PhysX will function correctly on any PC platform; however, there are no performance guarantees. Additionally, PhysX continues to be used as a software library on the three major consoles, yet another incentive for developer adoption.
Profiling PhysX
Nvidia unquestionably uses PhysX as an exclusive marketing tool for their GPUs, and it clearly benefits from executing on a GPU. Nvidia claims that a modern GPU can improve physics performance by 2-4X over a CPU. That’s a pretty impressive claim, and some benchmarks (e.g. Cryostasis) seem to bear that out. However, detractors of Nvidia (largely those working at one of Nvidia's competitors) have repeatedly claimed that PhysX purposefully handicaps execution on a CPU to make GPUs look better. Of course, comments from a competitor should be taken with a large grain of salt. But if Nvidia does cripple CPU PhysX, it would throw into question the extent to which GPU PhysX is really beneficial. Certainly a 4X advantage is worth while. However, if the CPU is really hobbled and runs 2X slower by design, that would mean that the GPU only has a 2X advantage in reality, which is far less impressive.
A couple months ago, we decided we would profile a couple of applications which use PhysX to test how PhysX behaves on the CPU and GPU. Initially, we were going to use VTune to compare, contrast and analyze both GPU accelerated and CPU PhysX by collecting performance counter data. However, after we first ran the experiment with VTune to analyze PhysX execution on the CPU, our results were so strange that we changed our plan to focus solely on profiling CPU PhysX and examine how it is tuned for the CPU.
Experimental Setup
Our test system is a relatively modern 3.2GHz Nehalem (Bloomfield), with a Nvidia GTX 280 GPU and 3GB of memory (3 DIMMs). It runs Windows 7 (64-bit), with nvcuda.dll version 8.17.11.9621 and PhysX version 09.09.1112. To test PhysX, we used the Cryostasis tech demo and also the Dark Basic PhysX Soft Body Demo and analyzed the execution using Intel’s VTune. In each case, the NV control panel was set to disable hardware PhysX acceleration, and then run with VTune. For comparison, the two tests were also run with GPU accelerated PhysX. As expected, the GPU accelerated versions ran at a reasonable speed with very nice effects. However, the CPU chugged along rather sluggishly. There was a very clear difference in performance that shows the benefits of accelerating PhysX on a GPU.
VTune analyzes the execution of an application at several levels of granularity. The coarsest is the processes running in the system. From there, VTune can drill down into interesting processes and examine the threads within the process. The finest granularity is inspecting the individual modules executed within each thread. For each of the tests, we analyzed at every level and highlighted the key processes, threads and modules being used. We also tracked several performance counters, which are reported in our results:
Cycles – The number of unhalted clock cycles
Instructions – The total number of instructions retired
x87 instructions – The total number of x87 instructions retired, which will be a portion of the overall instructions retired
x87 uops – The number of x87 uops executed (note that a uop can be executed, but then squashed e.g. due to branch misprediction).
FP SSE uops – The number of floating point SSE uops executed (this includes SSE1 and SSE2 uops, both scalar and packed)
VTune also tracks the Instruction Per Cycle (or IPC), which is the average number of instructions retired each cycle. Nehalem can retire up to 4 instructions per cycle, and realistically it can probably sustain an IPC of 0.5-1.5 on most workloads.
One essential reminder: VTune uses statistical sampling, and thus the accuracy depends on the number of samples. If there are relatively few samples, then the numbers may vary substantially. In general, the longer running processes/threads/modules will be sampled more often and hence generate more accurate data, while those processes/threads/modules which run only briefly may yield less than ideal results. One advantage of working with modern CPUs is that they execute billions of cycles per second, so the law of large numbers ensures that the results are accurate and stable.
While it would be nice to track many more performance counters, we were ultimately limited by the amount of time available, and frankly many of the counters were relatively uninteresting in the context of PhysX. The results of our profiling are on the next page.作者: Edison 时间: 2010-7-7 17:03
Profiling Results
With VTune, we first profiled at the coarsest granularity - focusing on processes running in the system. Based on the number of instructions retired and cycles spent, we selected the top processes. To drill down further, we profiled the threads within each top process. Last, we selected the top threads and then profiled the modules within each top thread.
Chart 1 below shows the results from profiling the active processes for both workloads (Cryostasis and Soft Body Physics). In each case, we kept the top 10 process, as measured by the percentage of instructions retired. Generally, the percentage of cycles is closely correlated with the instructions retired, but there is some slight variation. In each of the charts, we bolded the entries that were important and selected for further analysis. The right hand side of the chart contains the number of events observed during the experiment, while the left hand side contains percentages for each type of event observed during the experiment. For example, 90.9% of the floating point SSE uops observed during our experiment were executed from the Cryostasis process.
Chart 1 – Process level view of PhysX applications
In Cryostasis, there is only one process of significance, cryostasis.exe itself; all others constitute roughly 2% of instructions retired and 10% of the cycles. Strangely enough, Cryostasis uses a tremendous amount of x87 instructions; roughly 31% of the instructions retired are x87. There are plenty of x87 uops, but hardly any SSE floating point uops, roughly a 100:1 ratio. Perhaps at finer granularity, it will be clear exactly where these x87 instructions are coming from. Despite the x87 instructions, the IPC is a respectable 1.15.
Similarly, the Soft Bodies demo is dominated by a single process which accounts for almost all instructions (97%) and cycles (87%). The SoftBodies.exe process is heavily weighted towards x87 instructions, which are 31% of all retired instructions, with few SSE floating point operations. Like Cryostasis, the IPC is pretty good, achieving 1.23, largely due to the structured nature of the underlying the physics code. The slight difference between the two probably reflects the additional code required for a game, rather than a simple screen demo.
Chart 2 – Thread level view of PhysX applications
Drilling down to the thread level in Chart 2, there are two significant threads within the cryostasis.exe process, although the labels defy easy comprehension. Thread99 is the more important of the two, accounting for 80% of the cycles and instructions retired, although thread24 is significant enough to note. Looking at thread99 in Chart 3, the vast majority of time is spent inside the PhysXCore.dll module, which uses no SSE and all x87 for floating point calculations (roughly 35% of instructions retired in PhysXCore.dll are x87). In fact, PhysXCore.dll is the culprit responsible for 91% of all x87 instructions retired in the entire process. Despite the use of x87, the IPC is fairly high, 1.4 instructions retired per cycle.
Chart 3 – Module level view of Cryostasis
Thread24 corresponds primarily to cryostasis.exe itself and is a smaller portion of the overall process (roughly 10%). Thread24 uses some SSE floating point operations, although this is still dwarfed by the overall use of x87 operations. There are roughly 3X as many x87 uops as SSE floating point uops, and the x87 instructions are 15% of the instructions retired in the module and 3% of the instructions retired in the process.
SoftBodies.exe has two principal component threads; thread71 is roughly 73% of the overall instructions retired and cycles, while thread1 is the remaining 26%. Thread71 is almost entirely composed of the PhysXCore.dll module. Again, this module does not use any SSE and instead relies on x87; an incredible 40% of the retired instructions are x87. Since the module dominates the process overall, it is not surprising that 95% of the x87 instructions retired in the process are found within this one module. The IPC for this module is similar to the IPC observed when executing cryostasis, a healthy 1.4, which helps to explain the overall IPC of the process.
Oddly enough neither workload is multithreaded in a meaningful way. In each case, one thread is doing 80-90% of the work, rather than being split evenly across two or four threads – or as is done in an Nvidia GPU, hundreds of threads.
Chart 4 – Module level view of SoftBodies.exe
The second and smaller thread1 is primarily ole32.dll, which is a library used by Windows for OLE (Object Linking and Embedding). The ole32.dll module has a little x87 code, about 6% of instruction retired, but far less than the massive 40% found in PhysXCore.dll. It’s not quite clear what the library is actually doing, but it only contributes a little to the overall use of x87.
Overall, the results are somewhat surprising. In each case, the PhysX libraries are executing with an IPC>1, which is pretty good performance. But at the same time, there is a disturbing large amount of x87 code used in the PhysX libraries, and no SSE floating point code. Moreover, PhysX code is automatically multi-threaded on Nvidia GPUs by the PhysX and device drivers, whereas there is no automatic multi-threading for CPUs.
作者: Edison 时间: 2010-7-7 17:03
Why x87?
The x87 floating point instructions are positively ancient, and have been long since deprecated in favor of the much more efficient SSE2 instructions (and soon AVX). Intel started discouraging the use of x87 with the introduction of the P4 in late 2000. AMD deprecated x87 since the K8 in 2003, as x86-64 is defined with SSE2 support; VIA’s C7 has supported SSE2 since 2005. In 64-bit versions of Windows, x87 is deprecated for user-mode, and prohibited entirely in kernel-mode. Pretty much everyone in the industry has recommended SSE over x87 since 2005 and there are no reasons to use x87, unless software has to run on an embedded Pentium or 486.
x87 uses a stack of 8 registers with an extended precision 80-bit floating point format. However x87 data is primarily stored in memory with a 64-bit format that truncates the extra 16 bits. Because of this truncation, x87 code can return noticeably different results if the data is spilled to cache and then reloaded. x87 instructions are scalar by nature, and even the highest performance CPUs can only execute two x87 operations per cycle.
In contrast, SSE has 16 flat registers that are 128 bits wide. Floating point numbers can be stored in a single precision (32-bit) or double precision (64-bit) format. A packed (i.e. vectorized) SSE2 instruction can perform two double precision operations, or four single precision operations. Thus a CPU like Nehalem or Shanghai can execute 4 double precision operations, or 8 single precision operations per cycle. With AVX, that will climb to 8 or 16 operations respectively. SSE also comes in a scalar variety, where only one operation is executed per instruction. However, scalar SSE code is still somewhat faster than x87 code, because there are more registers, SSE instructions are slightly lower latency than the x87 equivalents and stack manipulation instructions are not needed. Additionally, some SSE non-temporal memory accessses are substantially faster (e.g. 2X for AMD processors) as they use a relaxed consistency model. So why is PhysX using x87?
PhysX is certainly not using x87 because of the advantages of extended precision. The original PPU hardware only had 32-bit single precision floating point, not even 64-bit double precision, let alone the extended 80-bit precision of x87. In fact, PhysX probably only uses single precision on the GPU, since it is accelerated on the G80, which has no double precision. The evidence all suggests that PhysX only needs single precision.
PhysX is certainly not using x87 because it contains legacy x87 code. Nvidia has the source code for PhysX and can recompile at will.
PhysX is certainly not using x87 because of a legacy installed base of older CPUs. Any gaming system purchased since 2005 will have SSE2 support, and the PPU was not released till 2006. Ageia was bought by Nvidia in 2008, and almost every CPU sold since then (except for some odd embedded ones) has SSE2 support. PhysX is not targeting any of the embedded x86 market either; it’s designed for games.
The truth is that there is no technical reason for PhysX to be using x87 code. PhysX uses x87 because Ageia and now Nvidia want it that way. Nvidia already has PhysX running on consoles using the AltiVec extensions for PPC, which are very similar to SSE. It would probably take about a day or two to get PhysX to emit modern packed SSE2 code, and several weeks for compatibility testing. In fact for backwards compatibility, PhysX could select at install time whether to use an SSE2 version or an x87 version – just in case the elusive gamer with a Pentium Overdrive decides to try it.
But both Ageia and Nvidia use PhysX to highlight the advantages of their hardware over the CPU for physics calculations. In Nvidia’s case, they are also using PhysX to differentiate with AMD’s GPUs. The sole purpose of PhysX is a competitive differentiator to make Nvidia’s hardware look good and sell more GPUs. Part of that is making sure that Nvidia GPUs looks a lot better than the CPU, since that is what they claim in their marketing. Using x87 definitely makes the GPU look better, since the CPU will perform worse than if the code were properly generated to use packed SSE instructions.作者: Edison 时间: 2010-7-7 17:04
Analysis
Realistically, Nvidia could use packed, single precision SSE for PhysX, if they wanted to take advantage of the CPU. Each instruction would execute up to 4 SIMD operations per cycle, rather than just one scalar operation. In theory, this could quadruple the performance of PhysX on a CPU, but the reality is that the gains are probably in the neighborhood of 2X on the current Nehalem and Westmere generation of CPUs. That is still a hefty boost and could easily move some games from the unplayable <24 FPS zone to >30 FPS territory when using CPU based PhysX. To put that into context, here’s a quote from Nvidia’s marketing:
[In Cryostasis], with fine grained simulation of water, icicle destruction, and particle effects, the CPU shows itself as woefully inadequate for delivering playable framerates. GPUs that lack PhysX support become bottlenecked as a result, delivering the same level of performance irrespective of the hardware's graphics capability. GeForce GPUs with hardware physics support show a 2-4x performance gain, delivering great scalability across the GPU lineup.
That 2-4X performance gain sounds respectable on paper. In reality though, if the CPU could run 2X faster by using properly vectorized SSE code, the performance difference would drop substantially and in some cases disappear entirely. Unfortunately, it is hard to determine how much performance x87 costs. Without access to the source code for PhysX, we cannot do an apples-to-apples comparison that pits PhysX using x87 against PhysX using vectorized SSE. The closest comparison would be to compare the three leading physics packages (Havok from Intel, PhysX from Nvidia and the open source Bullet) on a given problem, running on the CPU. Havok is almost certain to be highly tuned for SSE vectors, given Intel’s internal resources and also their emphasis on using instruction set extensions like SSE and the upcoming AVX. Bullet is probably not quite as highly optimized as Havok, but it is available in source form, so a true x87 vs. vectorized SSE experiment is possible.
Not only would this physics solver comparison reveal the differences due to x87 vs. vectorized SSE, it would also show the impact of multi-threading. A review at the Tech Report already demonstrated that in some cases (e.g. Sacred II), PhysX will only use one of several available cores in a multi-core processor. Nvidia has clarified that CPU PhysX is by default single threaded and multi-threading is left to the developer. Nvidia has demonstrated that PhysX can be multi-threaded using CUDA on top of their GPUs. Clearly, with the proper coding and infrastructure, PhysX could take advantage of several cores in a modern CPU. For example, Westmere sports 6 cores, and using two cores for physics could easily yield a 2X performance gain. Combined with the benefits of vectorized SSE over x87, it is easy to see how Pro-Aper multi-core implementation using 2-3 cores could match the gains of PhysX on a GPU.
While as a buyer it may be frustrating to see PhysX hobbled on the CPU, it should not be surprising. Nvidia has no obligation to optimize for their competitor’s products. PhysX does not run on top of AMD GPUs, and nobody reasonably expects that it will. Not only because of the extra development and support costs, but also AMD would never want to give Nvidia early developer versions of their products. Nvidia wants PhysX to be an exclusive, and it will likely stay that way. In the case of PhysX on the CPU, there are no significant extra costs (and frankly supporting SSE is easier than x87 anyway). For Nvidia, decreasing the baseline CPU performance by using x87 instructions and a single thread makes GPUs look better. This tactic calls into question the CPU vs. GPU comparisons made using PhysX; but the name of the game at Nvidia is making the GPU look good, and PhysX certainly fits the bill in the current incarnation.
The bottom line is that Nvidia is free to hobble PhysX on the CPU by using single threaded x87 code if they wish. That choice, however, does not benefit developers or consumers though, and casts substantial doubts on the purported performance advantages of running PhysX on a GPU, rather than a CPU. There is already a large and contentious debate concerning the advantages of GPUs over CPUs and PhysX is another piece of that puzzle, but one that seems to create questions, rather than answers.作者: 380 时间: 2010-7-7 17:15
提示: 作者被禁止或删除 内容自动屏蔽作者: heavenboy 时间: 2010-7-7 17:16
那nv不是丑大了?作者: 870717 时间: 2010-7-7 17:24