AMD 官方的 OpenCL 对角稀疏矩阵向量乘法探讨

Edison · 发表于 2010-5-14 08:55

http://developer.amd.com/Members ... umentation/articles

DDiagonal Sparse Matrix Vector Multiplication

Matrix vector multiplication is fundamental to a large number of algorithms in a variety of contexts, ranging from finite element modeling to image processing to machine learning. It's a simple computation to understand, but optimizing matrix multiplication can be instructive, since it requires attention to memory access patterns, data structures, and ALU utilization, all topics which come up in OpenCL™ programming in general.

Consider the following matrix vector multiplication, shown in figure 1:

Figure 1: Example Matrix Vector Multiplication

For each element of the result, we simply sum the element-wise product of the corresponding row from the matrix with the given vector.

Notice that most of the elements in the matrix we're multiplying are zero, so we can achieve better efficiency by treating it as a sparse matrix, avoiding lots of unnecessary multiplication by zero.

There are many kinds of sparse matrices, categorized by the structure of their non-zero elements. Here, we'll be looking at diagonal sparse matrices (often referred to as DIA), which are efficiently represented as a set of vectors representing the diagonals of the matrix with non-zero elements. Figure 2 shows the structure of a sample matrix derived from a regular 2-D grid, where every element of the grid is connected to its neighbors within a circular radius. The 2-D grid is flattened into a 1-D vector, and each non-zero entry in the matrix connects two elements of the grid - the elements represented by its row and column indices.

Figure 2: Sparse Matrix Derived From 2-D Grid

Looking at the structure of the matrix, you can see multiple bands of diagonals, each with a different width. The different widths of the various diagonal bands arise from the rasterized circular neighborhood pattern which created this matrix. If the neighborhood pattern was rectangular, each of the diagonal bands would be the same width. In general, diagonal sparse matrices are useful for representing relationships between elements on regular grids. In this article, we'll be using matrices derived from an image segmentation problem, which of course is expressed as a 2-D grid. Every element is related to all neighbors within a 5-element radius, which leads to a matrix with 11 bands of diagonals, 81 diagonals in total..

Edison · 发表于 2010-5-14 08:57

Memory Access Patterns

OpenCL™ devices, whether from AMD, IBM, or others, prefer memory loads to be dense and aligned. This conserves memory bandwidth, which is often the most crucial resource we can optimize. To see why, let's look at what happens when we load data from memory.

Figure 3 illustrates a load from memory on an abstract processor. When we request a word of data, the processor loads several consecutive words from memory, which include the word we requested. The processor loads more words than we asked for, because modern memory architectures can provide peak bandwidth when they return several contiguous words. This is fortuitous because data accesses are usually correlated in time and space, meaning that bringing in blocks of memory instead of just the word currently being accessed can serve as a prefetch, lowering overall effective memory latency. We'll call the set of words that the processor loads when you execute a memory load, a cache line. In Figure 3, when we request word 0 from memory, the processor ends up transferring words 0-3 from off-chip memory.

Figure 3: Cache Lines

Due to the way memory is accessed, poorly structured memory access patterns can greatly decrease performance, especially for highly-parallel processors with large amounts of computational resources, like AMD GPUs. As Figure 4 shows, sparse memory accesses waste bandwidth, since the unused words the processor loads are immediately discarded, and unaligned accesses waste bandwidth because the processor must load twice as many words as necessary.

Figure 4: Sparse and Misaligned Memory Accesses

AMD CPUs such as the AMD Phenom II X4 965 CPU have a 64-byte cache line. AMD GPUs such as the ATI Radeon HD 5870 GPU have the equivalent of a 128-byte cache line. For optimal performance on both architectures, it's best to have 128-byte aligned data structures. We will examine the implications of these for Diagonal Sparse Matrix Vector Multiplication.

Edison · 发表于 2010-5-14 08:58

Data Structure

Now that we understand the need for aligned, dense memory accesses, let's design a data structure for DIA sparse matrices. To do this, first note that if we consider the matrix as a set of dense vectors, one for each diagonal of the matrix with non-zeros, we will have converted a sparse data structure into a dense data structure. For example, one such structure for our example matrix is shown in Figure 5. If you compare the original matrix, you'll see that we've simply taken each of the diagonals and flattened them into a vector, then we stacked the vectors to create a dense matrix representation. We've also produced an array of offsets, with an entry for each diagonal, which record the position of that diagonal, which is used to index the matrix and vector.

Figure 5: Flattened diagonals + Offset array

Let's consider how this structure performs when we write an OpenCL™ C kernel to do the multiplication. Parallelization of this computation is trivial. The computation for each element of the output vector we're computing is independent, so we will simply instantiate a work-item for each element of the output vector. Each work-item then iterates through the set of diagonals, reading from the flattened diagonals and the vector at each iteration, and accumulating the product. Since we know the work-items are executed in SIMD fashion on AMD GPUs, we can see that memory accesses from the work-items to the matrix will be dense, but they will not be aligned.

We can see this because the position at which each work-item will read from the diagonals will change for each row of this flattened set of diagonals, due to the way we've flattened them. Figure 6 shows an alternative way of flattening the diagonals to ensure aligned memory accesses from the matrix, given our parallelization.

Figure 6: Matrix Representation For Aligned Access

Accesses to the vector are still unaligned, but that is unavoidable, since the alignment depends on the placement of the original diagonals in the matrix, which arises from the problem we're trying to solve, and is, therefore, much more difficult to change.

Edison · 发表于 2010-5-14 09:00

Serial Version of DIA Sparse Matrix Vector Multiply

After choosing the data structure to maximize our bandwidth usage, we can write a simple C kernel to prototype the computation we'll be performing.

void dia_spmv(float* A, int rows, int diags, int* offsets, float* x, float* y) {
for(int row = 0; row < rows; row++) {
float accumulator = 0;
for(int diag = 0; diag < diags; diag++) {
int col = row + offsets[diag];
if ((col >= 0) && (col < rows)) {
accumulator += A[diag * rows + row] * x[col];
}
}
y[row] = accumulator;
}

复制代码

}

The kernel is fairly simple, and all the iterations of the outer loop are independent, as we noted earlier. When translating this code to OpenCL™, we are going to instantiate a work-item for every iteration of the outermost loop, and then have each work-item perform the inner loop.

Edison · 发表于 2010-5-14 09:02

Initial OpenCL™ C Kernel

The translation to OpenCL™ C is then straightforward.

__kernel
void dia_spmv(__global float *A, __const int rows,
__const int diags, __global int *offsets,
__global float *x, __global float *y) {
int row = get_global_id(0);
float accumulator = 0;
for(int diag = 0; diag < diags; diag++) {
int col = row + offsets[diag];
if ((col >= 0) && (col < rows)) {
float m = A[diag*rows + row];
float v = x[col];
accumulator += m * v;
}
}
y[row] = accumulator;
}

复制代码

The OpenCL™ C code is fairly similar to the original C version, with just the addition of a few OpenCL™ C qualifiers, and with the outermost loop replaced by a parallel kernel.

Edison · 发表于 2010-5-14 09:03

Initial Performance on ATI Radeon HD 5870 GPU

We benchmarked our SpMV code on a system with an AMD Phenom II X4 965 CPU, running at 3.4 GHz, with 21.3 GB/s of peak memory bandwidth, coupled to an ATI Radeon HD 5870 GPU, clocked at 850 MHz, with 153.6 GB/s of peak memory bandwidth on an MSI K9A2 Platinum motherboard with 8GB of DRAM. The system was running Ubuntu Linux® 9.10 (64-bit), the ATI Catalyst™ 10.4 Driver Suite and the ATI Stream SDK v2.1.

Figure 7 shows the performance we achieved across a range of matrix sizes using this basic implementation. For context, a dual-socket, quad-core system using the AMD Opteron™ 2356 CPU, which has the same amount of peak memory bandwidth as our AMD Phenom II X4 965 CPU, achieves 3.5 DP GFLOP/s on similar matrices [1]. To make a fair comparison, we note that we are examining single precision floating-point, which saves us about a factor of 1.5 in performance, assuming we're bandwidth limited and have perfect caching of the vector. Additionally, we're using the specialized DIA matrix format, as opposed to the general CSR format used in [1], which saves us about another factor of 2, again assuming perfect caching for the vector and the offset array. A rough estimate for optimized CPU performance with our setup would then be about 10.5 SP GFLOP/s.

Our unoptimized GPU implementation reaches 13.1 SP GFLOP/s, which is respectable in comparison.

Sparse matrix vector multiplication is simple enough that we can compute some bounds on performance to figure out how much performance headroom remains with our setup. First, we note that for every 2 floating-point operations in the inner-most loop, we have to load 12 bytes of data from off-chip memory: 4 bytes for the integer offset of the diagonal we're computing, 4 bytes for the entry in the matrix, and 4 bytes for the entry in the vector. The ATI Radeon HD 5870 GPU has 153.6 GB/s of memory bandwidth, so if we're perfectly bandwidth bound, we should be able to sustain 2 FLOP/12 Bytes * 153.6 GBytes/s = 25.6 GFLOP/s. This implementation is getting about 51% of our bound. Maybe we can improve on that.

Figure 7: Initial ATI Radeon HD 5870 GPU Performance

Edison · 发表于 2010-5-14 09:19

太长了，最后 post 各种优化措施应用后的：

Performance of OpenCL™ C Kernel Using Images

Figure 11 shows the resulting performance, which improves by another 76%, to reach an average of 49.2 GFLOP/s. Since we moved accesses of the vector to use the image hardware, we no longer need to perform so many accesses of the vector from off-chip memory. Assuming perfect caching of the vector, we need only read one float from memory for every 2 floating-point operations. This leads us to a bound of 2 FLOP/4 Bytes * 153.6 GB/s = 76.8 GFLOP/s, putting our final code variant at 64% of its bound.

Figure 11: ATI Radeon HD 5870 GPU Performance Using Images

During the optimization process, we were able to improve the bound twice, by eliminating memory accesses to data structures that are relatively small and frequently used, but it's not possible to eliminate the final memory access, since each element of the matrix A is only used once, and the matrix itself is too large to fit into any structure we could use as an on-chip cache. At this point, we can be reasonably confident that our code is close to optimal. There may be other optimizations, like loop unrolling, etc., that we could try to apply, but since the compulsory loads from the matrix A will still need to be performed, such optimizations will only yield marginal benefits.

yyfking · 发表于 2010-10-21 21:51

很稀松平常的矩阵算法，只是不知道他的编译怎么就优化了，Intel C++的用户飘过

clockrun · 发表于 2010-10-22 15:33

只用过CUDA，请教一下，这个Image是何物啊

Edison · 发表于 2010-10-22 19:54

OpenCL 中 image object 就是图形中的纹理。

帐号		自动登录	找回密码
密码			注册

AMD 官方的 OpenCL 对角稀疏矩阵向量乘法探讨

浏览过的版块