Matrix Acceleration using CUDA

SIROCCO can use CUDA/GPUs to accelerate solving linear systems using the cuSOLVER library, which is part of the NVIDIA CUDA Toolkit.

This pilot study into using GPUs in SIROCCO was conducted as an HPC RSE project at the University of Southampton.

When should you use CUDA?

Given the pilot study nature of this work, the current matrix acceleration implementation (September 2023) is simple. In most small to mid-sized models, using GPU acceleration will not improve model performance. As of writing, CUDA is only used to accelerate matrix calculations specifically, so there are no performance improvements other than in models which are bogged down by matrix calculations; e.g. such as when calculating the ionization state for a large number of ions, or models with lots of macro atom levels/emissitivies. Even so, there may only be modest improvements in model performance if the majority of time is spent transporting and scattering photons.

It is therefore only preferable to use CUDA when matrices are large enough to warrant GPU acceleration. The size of where this is important is tricky to know, as it is hardware dependent - both on your CPU and GPU. If you are using an old CPU, then you are likely to see improvements from matrices as small as circa 200 x 200. Otherwise, you may need to reach up to matrix sizes of 500 x 500 (or larger!) before there is any tangible benefit.

This situation will be improved with smarter memory management and further parallelisation. Most time is spent copying data back and forth between the GPU and CPU. As an example, consider the matrix ionization state calculation. Currently only the actual step to stop the linear system (to calculate the ionization state) has been ported to the GPU. This means each iteration toward a converged ionization state requires memory to be copied to and from the GPU, which slows things down quite a bit. If you could instead port the entire iterative procedure to the GPU (which is not that easy), there is no longer the need to make expensive memory copies each iteration which will significantly speed up the algorithm.

Requirements

To use the CUDA matrix acceleration in SIROCCO, your machine needs to have the following installed,

A CUDA-capable NVIDIA GPU
NVIDIA CUDA Toolkit
NVIDIA GPU drivers
A supported operating system (Windows or Linux) with a gcc compiler and toolchain

NVIDIA provides a list of CUDA-enabled GPUs here. Whilst the GeForce series of NVIDIA GPUs are more affordable and generally good enough, from a purely raw computation standpoint NVIDIA’s workstation and data center GPUs are more suited due differences (and additional) in hardware not included in the GeForce line of GPUs.

Installing the CUDA toolkit

The NVIDIA CUDA Toolkit is installed either through an installer downloaded from NVIDIA or can be installed via a package manager on Linux systems. It should be noted that the CUDA Toolkit does not come with NVIDIA drivers and need to be installed separately. The NVIDIA CUDA Toolkit is available at https://developer.nvidia.com/cuda-downloads and NVIDIA drivers at https://www.nvidia.co.uk/Download/index.aspx.

On Ubuntu 22.04, the toolkit and NVIDIA’s proprietary drivers are available through apt,

sudo apt install nvidia-cuda-toolkit nvidia-driver-535

How to Enable and Run CUDA

Compilation

CUDA is an additional acceleration method and is therefore not enabled by default. To enable CUDA, SIROCCO has to be compiled with the additional -DCUDA_ON flag and linked with the appropriate libraries using the NVIDIDA CUDA compiler (nvcc). There are several ways to enable the CUDA components of SIROCCO. The most simple is to run the configure script in the root SIROCCO directory with the arguments --with-cuda,

[$SIROCCO] $ ./configure --with-cuda

Configuring Makefile for SIROCCO radiative transfer code
Checking for mpicc...yes
Checking for gcc...yes
Checking for nvcc...yes
Preparing Makefile...Done.

If the NVIDIA CUDA Toolkit is found, you will see the output informing that the CUDA compiler nvcc was found.

What essentially happens when you run code is that a value for the variable NVCC is set in the Makefile in SIROCCO’s source directory. If you re-run configure without --with-cuda, then NVCC will be unset and CUDA will not be used. CUDA can be disabled or enabled “on the fly” by modifying this variable without running the configure script and by modifying the Makefile or passing the value of the variable when calling the Makefile, e.g.,

[$SIROCCO/source] $ make clean
[$SIROCCO/source] $ make sirocco NVCC=nvcc

make clean has to be run whenever CUDA is switched been enabled or disabled, due to code conditionally compiling depending on if CUDA is enabled or not.

Running

To run SIROCCO with CUDA, you run it in the exact way even parallel models running with MPI. On a HPC system the appropriate GPU resources will need to be requested in a job submission script. For example, on Iridis at the University of Southampton, a functional job submission script may look like this,

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=40
#SBATCH --gpus-per-node=1
#SBATCH --time=06:00:00
#SBATCH --partition=gpu

module load openmpi/4.1.5/gcc

mpirun -n $SLURM_NTASKS py model.pf

If CUDA is enabled and no GPU resources are found, SIROCCO will exit early in the program with an appropriate error message. Note that a CUDA-aware MPI implementation is not required, as no data is communicated between GPUs.

Implementation

In this part of the documentation, we will cover the implementation details of cuSolver in SIROCCO. cuSolver is a matrix library within the NVIDIA CUDA ecosystem, designed to accelerate both dense and sparse linear algebra problems, including matrix factorisation, linear system solving and matrix inversion. To use cuSolver, very little GPU specific code needs to be written, other than code to allocate memory on the GPU. There are therefore a number of similarities between writing functions which use the cuSolver (and other CUDA mathematical libraries) and GSL libraries.

The CUDA parallel model

The main difference between CPU and GPU parallel programming is the number of (dumb) cores in a GPU. Whereas on a CPU where we divide work on a matrix into smaller chunks, on a GPU it is realistic to have each core of the GPU operate on a single element of the matrix whereas a CPU will likely have multiple elements. CUDA is a type of shared memory parallel programming, and at its core are kernels, which are specialised functions designed for massive parallelism. These kernels are executed by each thread (organized in blocks and grids), where thousands are launched and execute the code concurrently allowing for massive parallelism.

As an example, consider matrix multiplication. If the calculation is parallelised, each CPU core will likely need to calculate the matrix product for multiple elements of the matrix. On a GPU, each thread that is launched will calculate the product for only a single element. If there are enough GPU cores available, then the calculation can be done in effectively a single step which all threads calculating the product for each element at once.

A more detailed and thorough explanation of the CUDA programming model can be found in the CUDA documentation.

Basics

SIROCCO uses the dense matrix functions in cuSolver, which are accessed through the cusolverDn.h header file. To use cuSolver, it must first be initialised. To do so, we use cusolverDnCreate to create a cuSolverDnHandle_t variable which is used by cuSolver internally for resource and context management.

cuSolver is based on the Fortran library LAPACK and as such expects arrays to be ordered in column-major order like in Fortran. In C, arrays are typically ordered in row-major order and so arrays must be transposed into column-major ordering before being passed to cuSolver (an explanation of the differences between row and column major ordering can be found here). Matrices can be transposed either whilst still on the CPU, or on the GPU by using a CUDA kernel as shown in the example below,

A CUDA kernel to transpose a matrix from row to column major

__global__ void  /* __global__ is used by kernels, all of which return void */
transpose_row_to_column_major(double *row_major, double *column_major, int matrix_size)
{
    /* Determine the x and y coordinate for the thread -- these coords could be
       outside the matrix if enough threads are spawned */
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;
    const int idy = blockIdx.y * blockDim.y + threadIdx.y;

    /* Only transpose for threads inside the matrix */
    if (idx < matrix_size && idy < matrix_size) {
        column_major[idx * matrix_size + idy] = row_major[idy * matrix_size + idx];
    }
}

The syntax of the above is covered in detail in the CUDA documentation. The purpose of the kernel is take in a row major array and to transpose it to column major.

Every cuSolver (and CUDA) function returns an error status. To make code more readable, a macro is usually defined which checks the error status and raises an error message if the function does not execute successfully. This type of macro is used extensively throughout the implementation.

A useful macro for error checking cuSolver returns

#define CUSOLVER_CHECK(status)                                                                                     \
    do {                                                                                                           \
        cusolverStatus_t err = status;                                                                             \
        if (err != CUSOLVER_STATUS_SUCCESS) {                                                                      \
            Error("cuSolver Error (%d): %s (%s:%d)\n", err, cusolver_get_error_string(err), __FILE__, __LINE__);   \
            return err;                                                                                            \
        }                                                                                                          \
    } while (0)

/* Here is an example of using the macro to create a handle */
CUSOLVER_CHECK(cusolverDnCreate(&handle));

Structure

When writing CUDA C, it is convention to put the CUDA code into .cu files and the CPU code in .c files. Even when using a library like cuSolver, it is still convention to place that code into .cu files as we still need to access some CUDA library functions, such as cudaMalloc or cudaMemCpy.

The CUDA code associated with matrix parallelisation has been written in the file $SIROCCO/source/matrix_gpu.cu with the header file $SIROCCO/source/matrix_gpu.h which includes the function prototypes for the GPU matrix code. The GSL matrix code is kept in $SIROCCO/source/matrix_cpu.c with function prototypes in $SIROCCO/source/templates.h.

To be able to switch between the CUDA and GSL matrix implementations with the minimal amount of code changes, a solve_matrix wrapper function has been created. Either GSL or cuSolver is called within this wrapper, depending on if SIROCCO was compiled with the flag -DCUDA_ON as discussed earlier. This wrapper takes on the same name as the original GSL implementation, meaning no code changes have occurred in that regard.

The wrapper function which calls the appropriate matrix solver

#include "matrix_gpu.h"  /* The function prototype for gpu_solve_matrix is in here */

int
solve_matrix(double *a_matrix, double *b_vector, int matrix_size, double *x_vector)
{
    int error;

#ifdef CUDA_ON
    error = gpu_solve_matrix(...);  /* CUDA implementation */
#else
    error = cpu_solve_matrix(...);  /* GSL implementation */
#endif

    return error;
}

The following code exert is an example of using the wrapper function to solve a linear system.

The API to solve a linear system hasn’t changed

#include "sirocco.h"

double *populations = malloc(nions * sizeof(*populations));
double *ion_density = malloc(nions * sizeof(*populations));
double *rate_matrix = malloc(nions * nions * sizeof(*populations));

populate_matrices(rate_matrix, ion_density);

/* The wrapper function is named the same as the original GSL implementation
   and accepts the same arguments */
int error = solve_matrix(
    rate_matrix, ion_density, nions, populations, xplasma->nplasma
);

/* One user difference is that error handling is more robust now, and there
   is a function to convert error codes into error messages */
if (error != EXIT_SUCCESS) {
    Error(
        "Error whilst solving for ion populations: %d (%d)\n",
        get_matrix_error_string(error), error
    );
}

Here is an example of using a similar wrapper function to calculate the inverse of a matrix.

The API has changed slightly for calculating the inverse, now that it has a wrapper function

#include "sirocco.h"

double Q_matrix = malloc(matrix_size * matrix_size * sizeof(double));
double Q_inverse = malloc(matrix_size * matrix_size * sizeof(double));

populate_Q_matrix(Q_matrix);

/* The API is only different in the sense that a wrapper function now
   exists for matrix inversion */
int error = invert_matrix(
    Q_matrix, Q_inverse, matrix_size
);

if (error != EXIT_SUCCESS) {
    Error(
        "Error whilst solving for ion populations: %d (%d)\n",
        get_matrix_error_string(error), error
    );
}

To write the cuSolver implementation is similar to the GSL implementation, in that memory/resource are allocated for cuSolver and then the appropriate library functions are called. The code exert below shows an illustrated (and simplified) example of the cuSolver implementation to solve a linear system.

An illustrative example of using cuSolver to solve a linear system using LU decomposition

#include <cuSolverDn.h>

extern "C" int  /* extern "C" has to be used to make it available to the C run time */
gpu_solve_matrix(double *a_matrix, double *b_vector, int matrix_size, double *x_vector)
{
    /* First of all, allocate memory on the GPU and copy data from the CPU to the
       GPU. This uses the CUDA standard library functions, such as cudaMemCpy and
       cudaMalloc. This is part of the code is what takes the most time. */
    allocate_memory_for_gpu();
    copy_data_to_gpu();

    /* cuSolver and cuBLAS are both ports of Fortran libraries, which expect arrays to
    be in column-major format and we therefore need to transpose our row-major arrays */
    transpose_row_to_column_major<<<grid_dim, block_dim>>>(
        d_matrix_row, d_matrix_col, matrix_size
    );

    /* Perform LU decomposition. Variables prefixed with d_ are kept in GPU memory where we
    allocated space for them in `allocate_memory_for_gpu` */
    CUSOLVER_CHECK(cusolverDnDgetrf(
        CUSOLVER_HANDLE, matrix_size, matrix_size, d_matrix_col, matrix_size,
        d_workspace, d_pivot, d_info
    ));

    /* Solve the linear system A x = b. The final solution is returned in the
       variable d_v_vector */
    CUSOLVER_CHECK(cusolverDnDgetrs(
        CUSOLVER_HANDLE, CUSOLVER_OP_N, matrix_size, matrix_size, d_matrix_col,
        matrix_size, d_pivot,
        d_b_vector, matrix_size, d_info
    ));

    /* We now have to copy d_b_vector back to the CPU, so we can use that value in
    the rest of SIROCCO */
    copy_data_to_cpu();

    return EXIT_SUCCESS;
}

The naming conventions of cuSolver are discussed here. In the case above, cuSolverDnDgetrf corresponds to: cusolverDn = cuSolver Dense Matrix, D = double precision (double) and getrf = get right hand factorisation.

The most important thing to note, which may appear trivial, is the extern keyword. Without this, when the program is compiled the function gpu_solve_matrix will not be available to the C runtime. By labelling the function as extern "C", we make it clear that we want this function to be available to C source code. This only needs to be done at the function definition, and not the function prototype in, e.g., a header file.

Compiling and Linking

CUDA code is compiled using the NVIDIA CUDA Compiler nvcc. To combine both CPU and GPU code, the source must be compiled with the respective compilers (e.g. gcc/mpicc for C and nvcc for CUDA) to object code (.o files) and which are linked together using the C compiler with the appropriate library flags. In addition to needing to link the cuSolver library (-lcusolver) we also need to link the CUDA runtime library (-lcudart) when linking with the C compiler, which makes the standard CUDA library functions available to the C compiler and runtime.

The steps for compiling and link GPU and CPU code are outlined below in pseudo-Makefile code.

A brief overview on how to compile and link C and CUDA code

# Define compilers for C and CUDA. When creating a CUDA/MPI application, we can
# just as easily use mpicc for our C compiler. It makes no difference.
CC = mpicc
NVCC = nvcc

# Define C and CUDA libraries. We still include GSL as other GSL numerical routines are
# used in SIROCCO
C_LIBS = -lgsl -lgslcblas -lm
CUDA_LIBS = -lcudart -lcusolver

# Define flags for C and CUDA compilers. -DCUDA_ON is used to conditionally compile
# to use the CUDA wrappers and other things related to the CUDA build
C_FLAGS = -O3 -DCUDA_ON -DMPI_ON -I../includes -L../libs
CUDA_FLAGS = -O3 -DCUDA_ON

# Compile CUDA source to object code using the CUDA compiler
$(NVCC) $(CUDA_FLAGS) $(CUDA_SOURCE) -c -o $(CUDA_OBJECTS)

# Compile the C code using the C compiler
$(CC) $(C_FLAGS) $(C_SOURCE) -c -o $(C_OBJECTS)

# Link the CUDA and C object code and libraries together using the C compiler
$(CC) $(CUDA_OBJECTS) $(C_OBJECTS) -o sirocco $(CUDA_LIBS) $(C_LIBS)

These steps are effectively replicated in the Makefile $SIROCCO/source/Makefile, where a deconstructed example is shown below.

The variables and recipes associated with CUDA are all conditional on NVCC being defined

# If NVCC has been set in the Makefile, then we can define CUDA_FLAG = -DCUDA_ON,
# and the CUDA sources, which, at the moment, uses a wildcard to find all .cu files
ifneq ($(NVCC), )
    CUDA_FLAG = -DCUDA_ON
    CUDA_SOURCE = $(wildcard *.cu)
else
    CUDA_FLAG =
    CUDA_SOURCE =
endif

# Then the recipe to create CUDA object code looks like this. If NVCC is blank,
# nothing happens in the recipe
$(CUDA_OBJECTS): $(CUDA_SOURCE)
ifneq ($(CUDA_FLAG),)
    $(NVCC) $(NVCC_FLAGS) -DCUDA_ON -I$(INCLUDE) -c $< -o $@
endif

# So to compile SIROCCO, we have something which looks vaguely like this. Note that
# we use the CUDA_OBJECTS recipe as a requirement for the sirocco recipe. This CUSOLVER_STATUS_SUCCESS
# the CUDA source to be compiled to object code *if* NVCC is defined
sirocco: startup sirocco.o $(sirocco_objects) $(CUDA_OBJECTS)
    $(CC) $(CFLAGS) sirocco.o $(sirocco_objects) $(CUDA_OBJECTS) $(kpar_objects) $(LDFLAGS) -o sirocco