Matrix vector multiplication is fundamental to a large number of algorithms in a variety of contexts, ranging from finite element modeling to image processing to machine learning. It's a simple computation to understand, but optimizing matrix multiplication can be instructive, since it requires attention to memory access patterns, data structures, and ALU utilization, all topics which come up in OpenCL™ programming in general.
Consider the following matrix vector multiplication, shown in figure 1:
Figure 1: Example Matrix Vector Multiplication
For each element of the result, we simply sum the element-wise product of the corresponding row from the matrix with the given vector.
Notice that most of the elements in the matrix we're multiplying are zero, so we can achieve better efficiency by treating it as a sparse matrix, avoiding lots of unnecessary multiplication by zero.
There are many kinds of sparse matrices, categorized by the structure of their non-zero elements. Here, we'll be looking at diagonal sparse matrices (often referred to as DIA), which are efficiently represented as a set of vectors representing the diagonals of the matrix with non-zero elements. Figure 2 shows the structure of a sample matrix derived from a regular 2-D grid, where every element of the grid is connected to its neighbors within a circular radius. The 2-D grid is flattened into a 1-D vector, and each non-zero entry in the matrix connects two elements of the grid - the elements represented by its row and column indices.
Figure 2: Sparse Matrix Derived From 2-D Grid
OpenCL™ devices, whether from AMD, IBM, or others, prefer memory loads to be dense and aligned. This conserves memory bandwidth, which is often the most crucial resource we can optimize. To see why, let's look at what happens when we load data from memory.
Figure 3 illustrates a load from memory on an abstract processor. When we request a word of data, the processor loads several consecutive words from memory, which include the word we requested. The processor loads more words than we asked for, because modern memory architectures can provide peak bandwidth when they return several contiguous words. This is fortuitous because data accesses are usually correlated in time and space, meaning that bringing in blocks of memory instead of just the word currently being accessed can serve as a prefetch, lowering overall effective memory latency. We'll call the set of words that the processor loads when you execute a memory load, a cache line. In Figure 3, when we request word 0 from memory, the processor ends up transferring words 0-3 from off-chip memory.
Figure 3: Cache Lines
Due to the way memory is accessed, poorly structured memory access patterns can greatly decrease performance, especially for highly-parallel processors with large amounts of computational resources, like AMD GPUs. As Figure 4 shows, sparse memory accesses waste bandwidth, since the unused words the processor loads are immediately discarded, and unaligned accesses waste bandwidth because the processor must load twice as many words as necessary.
Figure 4: Sparse and Misaligned Memory Accesses
AMD CPUs such as the AMD Phenom II X4 965 CPU have a 64-byte cache line. AMD GPUs such as the ATI Radeon HD 5870 GPU have the equivalent of a 128-byte cache line. For optimal performance on both architectures, it's best to have 128-byte aligned data structures. We will examine the implications of these for Diagonal Sparse Matrix Vector Multiplication.
Now that we understand the need for aligned, dense memory accesses, let's design a data structure for DIA sparse matrices. To do this, first note that if we consider the matrix as a set of dense vectors, one for each diagonal of the matrix with non-zeros, we will have converted a sparse data structure into a dense data structure. For example, one such structure for our example matrix is shown in Figure 5. If you compare the original matrix, you'll see that we've simply taken each of the diagonals and flattened them into a vector, then we stacked the vectors to create a dense matrix representation. We've also produced an array of offsets, with an entry for each diagonal, which record the position of that diagonal, which is used to index the matrix and vector.
Figure 5: Flattened diagonals + Offset array
Let's consider how this structure performs when we write an OpenCL™ C kernel to do the multiplication. Parallelization of this computation is trivial. The computation for each element of the output vector we're computing is independent, so we will simply instantiate a work-item for each element of the output vector. Each work-item then iterates through the set of diagonals, reading from the flattened diagonals and the vector at each iteration, and accumulating the product. Since we know the work-items are executed in SIMD fashion on AMD GPUs, we can see that memory accesses from the work-items to the matrix will be dense, but they will not be aligned.
We can see this because the position at which each work-item will read from the diagonals will change for each row of this flattened set of diagonals, due to the way we've flattened them. Figure 6 shows an alternative way of flattening the diagonals to ensure aligned memory accesses from the matrix, given our parallelization.
Figure 6: Matrix Representation For Aligned Access
Accesses to the vector are still unaligned, but that is unavoidable, since the alignment depends on the placement of the original diagonals in the matrix, which arises from the problem we're trying to solve, and is, therefore, much more difficult to change.
We benchmarked our SpMV code on a system with an AMD Phenom II X4 965 CPU, running at 3.4 GHz, with 21.3 GB/s of peak memory bandwidth, coupled to an ATI Radeon HD 5870 GPU, clocked at 850 MHz, with 153.6 GB/s of peak memory bandwidth on an MSI K9A2 Platinum motherboard with 8GB of DRAM. The system was running Ubuntu Linux® 9.10 (64-bit), the ATI Catalyst™ 10.4 Driver Suite and the ATI Stream SDK v2.1.
Figure 7 shows the performance we achieved across a range of matrix sizes using this basic implementation. For context, a dual-socket, quad-core system using the AMD Opteron™ 2356 CPU, which has the same amount of peak memory bandwidth as our AMD Phenom II X4 965 CPU, achieves 3.5 DP GFLOP/s on similar matrices [1]. To make a fair comparison, we note that we are examining single precision floating-point, which saves us about a factor of 1.5 in performance, assuming we're bandwidth limited and have perfect caching of the vector. Additionally, we're using the specialized DIA matrix format, as opposed to the general CSR format used in [1], which saves us about another factor of 2, again assuming perfect caching for the vector and the offset array. A rough estimate for optimized CPU performance with our setup would then be about 10.5 SP GFLOP/s.
Our unoptimized GPU implementation reaches 13.1 SP GFLOP/s, which is respectable in comparison.
Sparse matrix vector multiplication is simple enough that we can compute some bounds on performance to figure out how much performance headroom remains with our setup. First, we note that for every 2 floating-point operations in the inner-most loop, we have to load 12 bytes of data from off-chip memory: 4 bytes for the integer offset of the diagonal we're computing, 4 bytes for the entry in the matrix, and 4 bytes for the entry in the vector. The ATI Radeon HD 5870 GPU has 153.6 GB/s of memory bandwidth, so if we're perfectly bandwidth bound, we should be able to sustain 2 FLOP/12 Bytes * 153.6 GBytes/s = 25.6 GFLOP/s. This implementation is getting about 51% of our bound. Maybe we can improve on that.
Figure 7: Initial ATI Radeon HD 5870 GPU Performance
Figure 11: ATI Radeon HD 5870 GPU Performance Using Images
During the optimization process, we were able to improve the bound twice, by eliminating memory accesses to data structures that are relatively small and frequently used, but it's not possible to eliminate the final memory access, since each element of the matrix A is only used once, and the matrix itself is too large to fit into any structure we could use as an on-chip cache. At this point, we can be reasonably confident that our code is close to optimal. There may be other optimizations, like loop unrolling, etc., that we could try to apply, but since the compulsory loads from the matrix A will still need to be performed, such optimizations will only yield marginal benefits.
| 欢迎光临 POPPUR爱换 (https://we.poppur.com/) | Powered by Discuz! X3.4 |