Data StructureNow that we understand the need for aligned, dense memory accesses, let's design a data structure for DIA sparse matrices. To do this, first note that if we consider the matrix as a set of dense vectors, one for each diagonal of the matrix with non-zeros, we will have converted a sparse data structure into a dense data structure. For example, one such structure for our example matrix is shown in Figure 5. If you compare the original matrix, you'll see that we've simply taken each of the diagonals and flattened them into a vector, then we stacked the vectors to create a dense matrix representation. We've also produced an array of offsets, with an entry for each diagonal, which record the position of that diagonal, which is used to index the matrix and vector.
Figure 5: Flattened diagonals + Offset array
Let's consider how this structure performs when we write an OpenCL™ C kernel to do the multiplication. Parallelization of this computation is trivial. The computation for each element of the output vector we're computing is independent, so we will simply instantiate a work-item for each element of the output vector. Each work-item then iterates through the set of diagonals, reading from the flattened diagonals and the vector at each iteration, and accumulating the product. Since we know the work-items are executed in SIMD fashion on AMD GPUs, we can see that memory accesses from the work-items to the matrix will be dense, but they will not be aligned.
We can see this because the position at which each work-item will read from the diagonals will change for each row of this flattened set of diagonals, due to the way we've flattened them. Figure 6 shows an alternative way of flattening the diagonals to ensure aligned memory accesses from the matrix, given our parallelization.
Figure 6: Matrix Representation For Aligned Access
Accesses to the vector are still unaligned, but that is unavoidable, since the alignment depends on the placement of the original diagonals in the matrix, which arises from the problem we're trying to solve, and is, therefore, much more difficult to change.