|
Penryn's Enhanced Core Architecture
The quad core version of Penryn contains 820 million transistors (Kentsfield has 582 million) in two very small dies of 107mm2. That makes the new design 25 percent smaller than Intel's current 65nm Quad core (143 mm2).
The new Penryn CPU also has yet another addition to the x86 ISA: Intel Streaming SIMD Extensions 4 (SSE4) instructions. It has also been confirmed that Penryn will deliver higher IPC and higher clock speeds. Intel wouldn't say more than "more than 3 GHz", but considering that the FSB is bumped up to 1600 MHz, 3.2 GHz is likely. However, several Intel people confirmed that if necessary ("depending on what the competition does"), the 45nm CPUs can go quite a bit higher (3.6 GHz is probably a safe estimate, considering how far current Core 2 CPUs are able to overclock).
With regards to power, Intel will be introducing what it is calling "Deep Power Down Technology", or a new lower power state, C6. The new C6 state reduces core voltage down to the absolute minimum for the given process technology, shuts down the core clock as well as turns off all of the caches. It is the absolute lowest power state that can be attained and will be introduced on Mobile Penryn family processors.
Penryn family processors are supposed to be socket-compatible, meaning that on the desktop we will see them introduced as LGA-775 CPUs. We'd expect that Intel's new lineup of chipsets will be required, but we are not sure if the new chipsets will support the 1600MHz FSB out of the box or if a refresh will be required.
Penryn-based processors also have a much better divider unit, roughly doubling the divider speed using a faster divide technique called Radix 16. Also, the shuffle engine has been improved. Intel's "Super Shuffle Engine" is a 128-bit, single-pass shuffle unit that can perform full-width shuffles in a single cycle, improving performance for SSE2, SSE3 and SSE4 instructions that have shuffle-like operations such as pack, unpack and wider packed shifts.
The last improvement is the "Split Load Cache Enhancement" which lowers the impact of data which is not aligned to cacheline boundaries. This seems to happen in some SSE intensive imaging applications.
The Quad core desktop and the quad core Xeon products will need 120W, 80W and 50W (LV) just like today. The dual core products will get a 40W/65W and 80W TDP.
Better Virtualization
Intel's current hardware support for virtualization in the current Core architecture is lackluster to say the least. To understand this you must understand what happens in a "pure" software-based virtualization solution such as VMware ESX 2.5.3 running on older Intel CPUs.
A technique called "ring deprivileging" is used as the guest OS cannot be allowed to run in the lowest ring 0 where it normally runs; the Virtual Machine Manager or hypervisor now runs there. That means that every time the guest application asks the help of the guest OS, which needs to run instructions which are only available in ring 0, the VMM must intercept that "SYSENTER" and emulate the normal execution. This is quite costly in performance terms.
Hardware assisted virtualization does not have that problem: both the OS and the VMM have their own ring 0. Despite this, Intel's HW assisted solutions didn't give any speed boost. It has not been discussed in detail, but Penryn speeds up virtual machine transition (entry/exit) times by 25% to 75%, and this requires no virtual machine software changes. This might be similar to AMD's nested page technology, although we don't have any clear details at present.
Last but not least, the dual core Penryn processors get a 6 MB shared cache and the quad versions get 12 MB cache. Both new designs will also come with a "higher degree of associativity". Considering the current designs are 16-way set associative, most likely the newer chips will feature a 24-way set associative L2 cache.
Intel EDAT: the End of the Multi-core Clock Speed Disadvantage?
Intel also talked about its "Enhanced Dynamic Acceleration Technology" which is effectively integrated overclocking based on load. If you are running a single threaded application (or a multi-threaded application that's predominantly using a single thread), Intel's EDAT can power down the second core and increase the frequency of the working core to maintain the same thermal envelope at all times.
Intel's EDAT could spell the end of the clock speed differential between single and multi-core processors. With all cores running workloads, the multi-core system would be clocked lower, but when some cores are idle the chip could potentially run at the same speed as a single core solution would. Single core designs have pretty much disappeared from roadmaps already, but considering there are still applications that are single threaded in nature and benefit more from clock speed improvements, future processors will offer both options in a single package.
Performance
Intel hasn't revealed too much about the performance of Penryn but Pat did leave us with a few comments. We don't know anything more about the test conditions than what we are presenting, and we didn't do the measurements ourselves, so take it for what it's worth.
Comparing a 3.2GHz Penryn (1.6GHz FSB) to a 3.0GHz Conroe (1.33GHz FSB), Intel has measured more than 20% increase in gaming performance (with no code changes). For video encoding applications, if SSE4 is utilized, the same Penryn vs. Conroe comparison can offer more than a 40% increase in performance.
Finally, Intel mentioned that in the server space, the fastest quad core Penryn available (>3GHz) vs. a 2.67GHz quad core Xeon resulted in a greater than 45% increase in performance in "bandwidth and FP intensive applications". It's incredibly vague (and oddly similar to AMD's claims of Barcelona vs. Xeon performance), but Pat mentioned that STREAM and certain benchmarks in SpecFP could be considered to be "bandwidth and FP intensive".
Again, we are just reporting what Intel told us. It will be a while before we can actually verify any of these claims or put them in the right context. Given the various enhancements that we've reported on, however, it's only reasonable to expect Penryn to be faster than Conroe, clock-for-clock. Whether that's 10% faster, 20% faster, or something else will be made clear in the future. |
|