Knights Corner: Highly programmable while being optimized for power efficiency and highly parallel workloads.
The MIC architecture is specifically designed to provide the programmability of an SMP system while optimizing for power efficiency and highly parallel workloads. Knights Corner is the first product to use this architecture and deliver on this exciting vision. It’s an SMP on-a-chip, with the following key improvements:
With Knights Corner, we retain the programmability of an SMP system while optimizing for power efficiency and highly parallel workloads. It’s an SMP on-a-chip, with the following key improvements:
We have scale and power benefits from using a small, simple micro-architecture. We extended a Intel® Pentium® processor design with 64-bit support and a few additional instructions (like CPUID) which are documented in Appendix B of the “Knights Corner Instruction Set Reference Manual.”
For high degrees of data parallelism, we added the Knights Corner vector capability not found in any other Intel processor.
We connect the many cores using a high performance on-chip interconnect design.
Knights Corner is designed to be a coprocessor that lives on the PCIe bus, and requires a host processor to boot the system.
This combination of Linux, 64-bits, and new vector capabilities with an Intel® Pentium® processor-derived core, means that Knights Corner is not completely binary compatible with any previous Intel processor. Because of its unique nature, you’ll see statements like this in our code: “Disclaimer: The codes contained in these modules may be specific to the Intel® Software Development Platform codenamed: Knights Ferry, and the Intel® product codenamed: Knights Corner, and are not backward compatible with other Intel® products. Additionally, Intel® makes no commitments for support of the code or instruction set in future products.” This notice speaks to low level details and affects tool vendors primarily. Tools by Intel, GCC and other vendors will support Knights Corner by following the Instruction Set Architecture (ISA) and Application Binary Interface (ABI) documents so most developers can program at a completely portable level in their applications.
http://software.intel.com/en-us/blogs/2012/06/05/knights-corner-micro-architecture-support/
http://www.brightsideofnews.com/news/2012/7/13/xeon-phi-lacks-binary-compatibility2c-breaks-amd64-conventions.aspx
To make matters worse, Intel breaks quite a few conventions related to AMD64. The x87 FPU has been deprecated by both AMD and Intel in favor of SSE2. At one point during the development of the 64-bit kernel of Windows XP, Microsoft even wanted to throw it out altogether, but kept for backwards compatibility reasons (still considered deprecated, meaning Microsoft doesn't recommend using it). AMD and Intel encourage developers to use SSE (and now also AVX) instruction set extensions wherever possible. AMD mentions that in their software optimization guides for the K8 and up (
page 83). Intel stresses on numerous occasions in chapter 5 and 6 of their “
” to prefer SSE and AVX over x87 whenever possible. Intel even states in the
on page 13 "All Intel® 64 architecture processors support Intel® SSE2." Larrabee invalidates that statement.
Tailoring an application for Larrabee is not really less work than going the GPGPU route with CUDA or OpenCL. In any case, the code needs to be tailored for the specific hardware (even though OpenCL promises to be platform agnostic, in practice you need to optimize for specific architectures to get meaningful performance). Ignoring performance characteristics and availability, these technologies are roughly on equal grounds. If we factor in that CUDA and to a lesser degree OpenCL have a head start on the market, things start to look dire for Intel.
http://blogs.nvidia.com/2012/04/no-free-lunch-for-intel-mic-or-gpus/
Native Mode Complications
Functionally, a simple recompile may work, but I’m convinced it’s not practical for most HPC applications and doesn’t reflect the approach most people will need to take to get good performance on their MIC systems.
“A simple recompile may work, but
I’m convinced it’s not practical for
most HPC applications and doesn’t
reflect the approach most people will
need to take to get good performance
on their MIC systems.”
The idea of running flat MPI code (one rank per core) on a multi-node MIC system seems quite problematic. Whatever memory sits on the MIC PCIe card will be shared by more than 50 cores, leading to very small memory per core. From what I know of the MPI communication stack, that won’t leave much memory for the actual data – certainly far below the traditional 1-2 GB/core most HPC apps want. And 50+ cores all trying to send messages through the system interconnect NIC seems like a recipe for a network logjam. The other concern is the Amdahl’s Law bottleneck resulting from executing all the per-rank serial code on a lower-performance, Pentium-class scalar core.
The OpenMP approach seems only slightly better. You’d still have the very small per-core memory and the Amdahl’s Law bottleneck, but at least you’d have fewer threads trying to send messages out the NIC. Perhaps the biggest issue with this approach is that existing OpenMP codes, written for multi-core CPUs, are unlikely to have enough parallelism exposed to profitably occupy over 50 vector cores.
No “Magic” CompilerThe reality is that there is no such thing as a “magic” compiler that will automatically parallelize your code. No future processor or system (from Intel, NVIDIA, or anyone else) is going to relieve today’s programmers from the hard work of preparing their applications for the future.
With clock rates stalled, all future performance increases must come from increased parallelism, and power constraints will actually cause us to use simpler processors at lower clock rates for the majority of our work, further exacerbating this issue.
“The reality is that there is no such
thing as a “magic” compiler that will
automatically parallelize your code.”
At the high end, an exaflop computer running at about 1 GHz will require approximately one billion-way parallelism, but the same logic will drive up required parallelism at all system scales. This means that all HPC codes will need to be cast as throughput problems with massive numbers of parallel threads. Exploiting locality will also become ever more important as the relative cost of data movement versus computation continues to rise. This will have a significant impact on the algorithms and data layouts used to solve many science problems, and is a fundamental issue not tied to any one processor architecture.