Kokkos

Author

Lucas Frérot, Lars Pastewka

Hardware abstraction

Kokkos is a hardware abstraction layer that uses templated C++. It can address NVIDIA, AMD and Intel hardware that all propagate their own programing models, in particular CUDA, HIP and SYCL, respectively. An overview of programing models can be found in Herten (2023).

Further Kokkos links:

Initialization

Kokkos needs to be initialized and finalized. You main code therefore should look like this:

#include <Kokkos_Core.hpp>

int main(int argc, char* argv[]) {
  // Initialize Kokkos
  Kokkos::initialize(argc, argv);

  // Run the simulation
  run_simulation();

  // Finalize Kokkos
  Kokkos::finalize();
  return 0;
}

Multidimensional arrays

A memory space defines where data resides on your machine. The layout defines how the data is distributed in that memory space. These concepts are implemented in a Kokkos View is a reference-counted multi-dimensional array with compile time layouts and memory space. Its semantics are similar to that of std::shared_ptr. The View API is described here.

As a example, let us consider a one-dimensional field \(u(x)\) discretized on \(N\) equally-spaced points \(x_i\). We represent this in our by a View:

// Field is a one-dimensional array
using Field_t = Kokkos::View<double*>;

// Allocate memory
const int N{1000};
Field_t u("u", N);

Note that all Views have names (to ease debugging).

Multidimensional array implement operator() for random access. For example, we can initialize the field with

dx = 0.1; // Grid spacing
for (int i{0}; i < N; ++i) {
    u(i) = 0.1 * i * dx;
}

For multidimensional fields, e.g. the density field in the Lattice Boltzmann method, we can define a two-dimensional array and access it through both dimensions:

using Density_t = Kokkos::View<double**>

Density_t density("density", 10, 15);
density(2, 3) = 1.5;

The Kokkos documentation has more information on constructing Views.

Note

Kokkos Views are automatically reference counted, making them them similar to a std::shared_ptr. This means assigning a View to another one does not actually copy the data, e.g.

Density_t a("a", 10, 15);
Density_t b("b", 10, 15);
b = a;

does not copy anything. Both views a and b point to the same memory location after the assignment. Use deep_copy to actually copy the data. deep_copy can also be used to copy data from host to device and back.

Parallel loops

A companion to the memory space is the execution space which determines where calculations are carried out. Kokkos supports three execution patterns: Loops (parallel_for), reductions parallel_reduce and prefix sums parallel_scan. We here first discuss the loops.

The parallel loop executes part of the loop in parallel, either by vectorizing or distributing among processing units. The above initialization loop can be written with the Kokkos parallel_for pattern in the form

Kokkos::parallel_for(
    "initialize_u", N, KOKKOS_LAMBDA(const int i, {
      u(i) = 0.1 * i * dx;
    });

The macro KOKKOS_LAMBDA wrap a C++ lambda expression.

Multidimensional loops

Specifying the number of iterates N implicitly defines what Kokkos calls a range execution policy. For multidimensional loops, e.g. over a two-dimension grid, we need to explicitly define this policy:

Kokkos::parallel_for(
    "initialize_u", Kokkos::MDRangePolicy({0, 0}, {Nx, Ny}),
    KOKKOS_LAMBDA(const int i, const int j) {
        u(i, j) = 0.1 * i * dx + 0.2 * j * dy;
    });

The first and second argument of MDRangePolicy give the lower and upper bounds of the iteration.

Example: The wave equation

Kokkos Views and the parallel_for pattern are already sufficient to implement simple numerical methods. We will here illustrate the numerical solution of the wave equation \[\frac{\partial^2 u}{\partial t^2} = \frac{1}{c^2} \frac{\partial^2 u}{\partial x^2} \equiv a(x,t)\] as an example of how to use Kokkos. Here \(u\) is a displacement, e.g. of a string on which the wave propagates. The right hand side \(a(x,t)\) is the acceleration of each point \(x\) on the string.

We discretize the \(x\)-positions on an equidistant lattice with grid spacing \(\Delta x\), \(x_i=i\Delta x\) and \(u_i=u(x_i)\). The second derivative can then be approximated through the second-order central-differences approximation \[ \frac{\partial^2 u}{\partial x^2} \approx \frac{u(x_{i-1})-2u(x_i)+u(x_{i+1})}{\Delta x^2}. \]

A straightforward implementation of this expression in Kokkos looks as follows:

/// Compute the second derivative using finite differences
void second_derivative_fd(Field_t& u_second, Field_t& u, double dx) {
  // Select everything except edges
  auto n{u.extent(0) - 2};

  // Centered finite differences
  Kokkos::parallel_for(
      "second_derivative_fd", n, KOKKOS_LAMBDA(const int i) {
        u_second(i + 1) = (u(i) - 2 * u(i + 1) + u(i + 2)) / (dx * dx);
      });
}

Note that the boundary nodes \(u_0\) and \(u_{N-1}\) remain undefined.

For the propagation in time we need a time-stepping algorithm. We will here use the velocity Verlet algorithm. Since we are solving a second-order equation in time, we need two boundary values: The displacements \(u(x,t)\) and the velocities \(v(x,t)=\dot u(x, t)\), where the dot indicates the derivative with respect to time \(t\). The velocity Verlet algorithm is \[\begin{align} u(x, t+\Delta t) &= u(x,t) + v(x,t) \Delta t + \frac{a(x,t)}{2} \Delta t^2 \\ v(x, t+\Delta t) &= v(x,t) + \frac{a(x,t) + a(x,t+\Delta t)}{2} \Delta t \end{align}\] In order to avoid storing both the previous \(a(x,t)\) and current \(a(x,t+\Delta t)\) accelerations, the velocity Verlet algorithm is often formulated as a predictor-corrector algorithm. The predictor step is then \[\begin{align} v'(x, t+\Delta/2) &= v(x,t) + \frac{a(x,t)}{2} \Delta t \\ u(x, t+\Delta t) &= u(x,t) + v'(x,t+\Delta/2) \Delta t, \end{align}\] which is followed by a recalculation of the accelerations \(a(x,t+\Delta t)\) from the updated displacements \(u(x, t+\Delta t)\). The corrector steps then updates the velocities from the “predicted” velocities \(v'(x,t+\Delta/2)\): \[ v(x, t+\Delta t) = v'(x, t+\Delta/2) + \frac{a(t+\Delta t)}{2} \Delta t \] An implementation of the predictor and corrector steps of this algorithm in Kokkos looks like:

/// Verlet predictor step
void verlet_predictor(Field_t& u_second, Field_t& u_first, Field_t& u,
                      double dt) {
  auto n{u.extent(0)};
  Kokkos::parallel_for(
      "verlet_predictor", n, KOKKOS_LAMBDA(const int i) {
        u(i) += dt * u_first(i) + (0.5 * dt * dt) * u_second(i);
        u_first(i) += (dt * 0.5) * u_second(i);
      });
}

/// Verlet corrector step
void verlet_corrector(Field_t& u_second, Field_t& u_first, Field_t& u,
                      double dt) {
  auto n{u.extent(0)};
  Kokkos::parallel_for(
      "verlet_corrector", n, KOKKOS_LAMBDA(const int i) {
        u_first(i) += (dt * 0.5) * u_second(i);
      });
}

The full integration can then be carried out via

/// Verlet full step
void verlet(Field_t& u_second, Field_t& u_first, Field_t& u, double dt,
            double dx) {
  verlet_predictor(u_second, u_first, u, dt);
  second_derivative_fd(u_second, u, dx);
  verlet_corrector(u_second, u_first, u, dt);
}

The full code for the wave equation can be found in wave.cpp.

Running Kokkos code

Kokkos will run in parallel even if you have no accelerator. An example execution spaces is Kokkos::Threads. To enable threads, set Kokkos_ENABLE_THREADS to true in meson.build.

When executing the code, you need to specify the number of threads. To run our wave-equation code on 4 parallel threads,

./wave --kokkos-num-threads=4

You can get a help page by running the code with the --kokkos-help command line option.

Files

Here are the helper files for this section:

References

Herten, Andreas. 2023. “Many Cores, Many Models: GPU Programming Model Vs. Vendor Compatibility Overview.” arXiv [Cs.DC].