Kokkos
Hardware abstraction
Kokkos is a hardware abstraction layer that uses templated C++. It can address NVIDIA, AMD and Intel hardware that all propagate their own programing models, in particular CUDA, HIP and SYCL, respectively. An overview of programing models can be found in Herten (2023).
Further Kokkos links:
Initialization
Kokkos needs to be initialized and finalized. You main code therefore should look like this:
#include <Kokkos_Core.hpp>
int main(int argc, char* argv[]) {
// Initialize Kokkos
::initialize(argc, argv);
Kokkos
// Run the simulation
();
run_simulation
// Finalize Kokkos
::finalize();
Kokkosreturn 0;
}
Multidimensional arrays
A memory space defines where data resides on your machine. The layout defines how the data is distributed in that memory space. These concepts are implemented in a Kokkos View
is a reference-counted multi-dimensional array with compile time layouts and memory space. Its semantics are similar to that of std::shared_ptr
. The View
API is described here.
As a example, let us consider a one-dimensional field \(u(x)\) discretized on \(N\) equally-spaced points \(x_i\). We represent this in our by a View
:
// Field is a one-dimensional array
using Field_t = Kokkos::View<double*>;
// Allocate memory
const int N{1000};
Field_t u("u", N);
Note that all View
s have names (to ease debugging).
Multidimensional array implement operator()
for random access. For example, we can initialize the field with
= 0.1; // Grid spacing
dx for (int i{0}; i < N; ++i) {
(i) = 0.1 * i * dx;
u}
For multidimensional fields, e.g. the density field in the Lattice Boltzmann method, we can define a two-dimensional array and access it through both dimensions:
using Density_t = Kokkos::View<double**>
Density_t density("density", 10, 15);
(2, 3) = 1.5; density
The Kokkos documentation has more information on constructing View
s.
Kokkos View
s are automatically reference counted, making them them similar to a std::shared_ptr
. This means assigning a View
to another one does not actually copy the data, e.g.
Density_t a("a", 10, 15);
Density_t b("b", 10, 15);
= a; b
does not copy anything. Both views a
and b
point to the same memory location after the assignment. Use deep_copy
to actually copy the data. deep_copy
can also be used to copy data from host to device and back.
Parallel loops
A companion to the memory space is the execution space which determines where calculations are carried out. Kokkos supports three execution patterns: Loops (parallel_for
), reductions parallel_reduce
and prefix sums parallel_scan
. We here first discuss the loops.
The parallel loop executes part of the loop in parallel, either by vectorizing or distributing among processing units. The above initialization loop can be written with the Kokkos parallel_for
pattern in the form
::parallel_for(
Kokkos"initialize_u", N, KOKKOS_LAMBDA(const int i, {
(i) = 0.1 * i * dx;
u});
The macro KOKKOS_LAMBDA
wrap a C++ lambda expression.
Multidimensional loops
Specifying the number of iterates N
implicitly defines what Kokkos calls a range execution policy. For multidimensional loops, e.g. over a two-dimension grid, we need to explicitly define this policy:
::parallel_for(
Kokkos"initialize_u", Kokkos::MDRangePolicy({0, 0}, {Nx, Ny}),
(const int i, const int j) {
KOKKOS_LAMBDA(i, j) = 0.1 * i * dx + 0.2 * j * dy;
u});
The first and second argument of MDRangePolicy
give the lower and upper bounds of the iteration.
Example: The wave equation
Kokkos View
s and the parallel_for
pattern are already sufficient to implement simple numerical methods. We will here illustrate the numerical solution of the wave equation \[\frac{\partial^2 u}{\partial t^2} = \frac{1}{c^2} \frac{\partial^2 u}{\partial x^2} \equiv a(x,t)\] as an example of how to use Kokkos. Here \(u\) is a displacement, e.g. of a string on which the wave propagates. The right hand side \(a(x,t)\) is the acceleration of each point \(x\) on the string.
We discretize the \(x\)-positions on an equidistant lattice with grid spacing \(\Delta x\), \(x_i=i\Delta x\) and \(u_i=u(x_i)\). The second derivative can then be approximated through the second-order central-differences approximation \[ \frac{\partial^2 u}{\partial x^2} \approx \frac{u(x_{i-1})-2u(x_i)+u(x_{i+1})}{\Delta x^2}. \]
A straightforward implementation of this expression in Kokkos looks as follows:
/// Compute the second derivative using finite differences
void second_derivative_fd(Field_t& u_second, Field_t& u, double dx) {
// Select everything except edges
auto n{u.extent(0) - 2};
// Centered finite differences
::parallel_for(
Kokkos"second_derivative_fd", n, KOKKOS_LAMBDA(const int i) {
(i + 1) = (u(i) - 2 * u(i + 1) + u(i + 2)) / (dx * dx);
u_second});
}
Note that the boundary nodes \(u_0\) and \(u_{N-1}\) remain undefined.
For the propagation in time we need a time-stepping algorithm. We will here use the velocity Verlet algorithm. Since we are solving a second-order equation in time, we need two boundary values: The displacements \(u(x,t)\) and the velocities \(v(x,t)=\dot u(x, t)\), where the dot indicates the derivative with respect to time \(t\). The velocity Verlet algorithm is \[\begin{align} u(x, t+\Delta t) &= u(x,t) + v(x,t) \Delta t + \frac{a(x,t)}{2} \Delta t^2 \\ v(x, t+\Delta t) &= v(x,t) + \frac{a(x,t) + a(x,t+\Delta t)}{2} \Delta t \end{align}\] In order to avoid storing both the previous \(a(x,t)\) and current \(a(x,t+\Delta t)\) accelerations, the velocity Verlet algorithm is often formulated as a predictor-corrector algorithm. The predictor step is then \[\begin{align} v'(x, t+\Delta/2) &= v(x,t) + \frac{a(x,t)}{2} \Delta t \\ u(x, t+\Delta t) &= u(x,t) + v'(x,t+\Delta/2) \Delta t, \end{align}\] which is followed by a recalculation of the accelerations \(a(x,t+\Delta t)\) from the updated displacements \(u(x, t+\Delta t)\). The corrector steps then updates the velocities from the “predicted” velocities \(v'(x,t+\Delta/2)\): \[ v(x, t+\Delta t) = v'(x, t+\Delta/2) + \frac{a(t+\Delta t)}{2} \Delta t \] An implementation of the predictor and corrector steps of this algorithm in Kokkos looks like:
/// Verlet predictor step
void verlet_predictor(Field_t& u_second, Field_t& u_first, Field_t& u,
double dt) {
auto n{u.extent(0)};
::parallel_for(
Kokkos"verlet_predictor", n, KOKKOS_LAMBDA(const int i) {
(i) += dt * u_first(i) + (0.5 * dt * dt) * u_second(i);
u(i) += (dt * 0.5) * u_second(i);
u_first});
}
/// Verlet corrector step
void verlet_corrector(Field_t& u_second, Field_t& u_first, Field_t& u,
double dt) {
auto n{u.extent(0)};
::parallel_for(
Kokkos"verlet_corrector", n, KOKKOS_LAMBDA(const int i) {
(i) += (dt * 0.5) * u_second(i);
u_first});
}
The full integration can then be carried out via
/// Verlet full step
void verlet(Field_t& u_second, Field_t& u_first, Field_t& u, double dt,
double dx) {
(u_second, u_first, u, dt);
verlet_predictor(u_second, u, dx);
second_derivative_fd(u_second, u_first, u, dt);
verlet_corrector}
The full code for the wave equation can be found in wave.cpp.
Running Kokkos code
Kokkos will run in parallel even if you have no accelerator. An example execution spaces is Kokkos::Threads
. To enable threads, set Kokkos_ENABLE_THREADS
to true
in meson.build
.
When executing the code, you need to specify the number of threads. To run our wave-equation code on 4 parallel threads,
./wave --kokkos-num-threads=4
You can get a help page by running the code with the --kokkos-help
command line option.
Files
Here are the helper files for this section: