Compiling Kokkos for GPUs

Author

Lars Pastewka

Overview

Kokkos provides a unified programming model that works across different GPU architectures by abstracting the underlying programming models. NVIDIA GPUs use CUDA, while AMD GPUs use HIP (Heterogeneous-compute Interface for Portability). This chapter explains how to compile Kokkos code to target these different GPU backends.

Hardware Architectures

Before compiling for a specific GPU, you need to know what hardware you’re targeting. The architecture name is crucial for optimal performance.

NVIDIA GPU Architectures

NVIDIA GPUs have evolved through several architecture generations. Some common ones:

Architecture Kokkos Flag GPU Examples Notes
Ampere Kokkos_ARCH_AMPERE80 A100, A10 Latest high-performance GPUs
Ampere Kokkos_ARCH_AMPERE86 RTX 30 series (desktop) Consumer-grade Ampere
Turing Kokkos_ARCH_TURING75 RTX 20 series, T4 Older data center GPUs
Volta Kokkos_ARCH_VOLTA70 V100, Titan V Previous generation data center

To check your GPU, use:

nvidia-smi

AMD GPU Architectures

AMD GPUs use the RDNA architecture family for modern GPUs:

Architecture Kokkos Flag GPU Examples
RDNA3 Kokkos_ARCH_AMD_RDNA3 MI300, MI300X
RDNA2 Kokkos_ARCH_AMD_RDNA2 MI100, MI200 series
RDNA Kokkos_ARCH_AMD_RDNA MI50, MI60

To check your AMD GPU and ROCm version, use:

rocm-smi
hipcc --version

Compiling with CUDA (NVIDIA GPUs)

Prerequisites

Before compiling, ensure you have:

  1. NVIDIA CUDA Toolkit installed (version 11.8 or later recommended)
  2. NVIDIA GPU drivers that support your GPU
  3. CMake build tool (at least version 3.21)
  4. A C++ compiler (typically nvcc comes with CUDA)

Check your CUDA installation:

nvcc --version

CMake Configuration for CUDA

First, install Kokkos with CUDA support:

git clone https://github.com/kokkos/kokkos.git
cd kokkos
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release \
      -DKokkos_ENABLE_SERIAL=ON \
      -DKokkos_ENABLE_THREADS=ON \
      -DKokkos_ENABLE_CUDA=ON \
      -DKokkos_ARCH_AMPERE80=ON \
      -DCMAKE_CXX_COMPILER=nvcc \
      ..
cmake --build . -j $(nproc)
cmake --install . --prefix ~/kokkos-install

Then use Kokkos in your project’s CMakeLists.txt:

cmake_minimum_required(VERSION 3.21)
project(WaveEquation LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

find_package(Kokkos REQUIRED)

add_executable(wave wave.cpp)
target_link_libraries(wave PRIVATE Kokkos::kokkos)

Build Commands

Set up and compile with CUDA:

# Create build directory
mkdir build
cd build

# Configure with Kokkos location
cmake -DKokkos_DIR=~/kokkos-install/lib/cmake/Kokkos ..

# Compile
cmake --build . -j $(nproc)

# Run on GPU
./wave --kokkos-devices=cuda

Running on GPUs

When running CUDA-compiled Kokkos code, use the --kokkos-devices flag:

# Use GPU (CUDA)
./wave --kokkos-devices=cuda

# Use specific GPU (if you have multiple)
./wave --kokkos-devices=cuda --kokkos-cuda-device=0

# View available options
./wave --kokkos-help

Advanced CUDA Options

For different NVIDIA architectures, change the Kokkos architecture flag when configuring Kokkos:

# For A100 (data center)
cmake -DKokkos_ARCH_AMPERE80=ON ..

# For RTX 3090 / RTX 4090 (consumer)
cmake -DKokkos_ARCH_AMPERE86=ON ..

# For V100
cmake -DKokkos_ARCH_VOLTA70=ON ..

# For T4
cmake -DKokkos_ARCH_TURING75=ON ..

Compiling with HIP (AMD GPUs)

Prerequisites

Before compiling for AMD GPUs, ensure you have:

  1. AMD ROCm installed (version 5.0 or later recommended)
  2. AMD GPU drivers for ROCm
  3. CMake build tool (at least version 3.21)
  4. hipcc compiler (comes with ROCm)

Check your ROCm installation:

hipcc --version
rocm-smi

CMake Configuration for HIP

First, install Kokkos with HIP support:

git clone https://github.com/kokkos/kokkos.git
cd kokkos
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release \
      -DKokkos_ENABLE_SERIAL=ON \
      -DKokkos_ENABLE_THREADS=ON \
      -DKokkos_ENABLE_HIP=ON \
      -DKokkos_ARCH_AMD_RDNA2=ON \
      ..
cmake --build . -j $(nproc)
cmake --install . --prefix ~/kokkos-install

Then use Kokkos in your project’s CMakeLists.txt:

cmake_minimum_required(VERSION 3.21)
project(WaveEquation LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

find_package(Kokkos REQUIRED)

add_executable(wave wave.cpp)
target_link_libraries(wave PRIVATE Kokkos::kokkos)

Build Commands

Set up and compile with HIP:

# Create build directory
mkdir build
cd build

# Configure with Kokkos location
cmake -DKokkos_DIR=~/kokkos-install/lib/cmake/Kokkos ..

# Compile
cmake --build . -j $(nproc)

# Run on GPU
./wave --kokkos-devices=hip

Running on GPUs

When running HIP-compiled Kokkos code:

# Use GPU (HIP/ROCm)
./wave --kokkos-devices=hip

# Use specific GPU (if you have multiple)
./wave --kokkos-devices=hip --kokkos-device=0

# View available options
./wave --kokkos-help

Advanced HIP Options

For different AMD architectures, change the Kokkos architecture flag when configuring Kokkos:

# For MI300 / MI300X (latest)
cmake -DKokkos_ARCH_AMD_RDNA3=ON ..

# For MI200 series (MI210, MI250)
cmake -DKokkos_ARCH_AMD_RDNA2=ON ..

# For MI100
cmake -DKokkos_ARCH_AMD_RDNA=ON ..

Unified CMake Configuration

You can create a build of Kokkos that supports multiple backends by setting multiple options. Kokkos will compile all enabled backends, and you can switch at runtime:

# Configure Kokkos with multiple backend support
cmake -DCMAKE_BUILD_TYPE=Release \
      -DKokkos_ENABLE_SERIAL=ON \
      -DKokkos_ENABLE_THREADS=ON \
      -DKokkos_ENABLE_OPENMP=ON \
      -DKokkos_ENABLE_CUDA=ON \
      -DKokkos_ARCH_AMPERE80=ON \
      -DKokkos_ENABLE_HIP=OFF \
      ..
cmake --build . -j $(nproc)
cmake --install . --prefix ~/kokkos-install

Then choose the backend at runtime:

# Run on NVIDIA GPU
./wave --kokkos-devices=cuda

# Run on CPU with threads
./wave --kokkos-devices=threads --kokkos-num-threads=4

# Run serially on CPU
./wave --kokkos-devices=serial

Troubleshooting

CUDA/NVCC Compilation Errors

Problem: nvcc: not found - Solution: Add CUDA to your PATH. If using the default CUDA installation:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Problem: Architecture mismatch errors - Solution: Verify your GPU architecture with nvidia-smi and set the correct Kokkos_ARCH_* flag when configuring Kokkos with CMake

HIP/ROCm Compilation Errors

Problem: hipcc: not found - Solution: Load the ROCm module (on HPC clusters):

module load rocm

Or add to PATH if ROCm is installed locally:

export PATH=/opt/rocm/bin:$PATH

Problem: “Could not find HIP” - Solution: Set the HIP path explicitly in your CMake configuration or environment:

export HIP_PATH=/opt/rocm/hip
cmake -DHIP_PATH=/opt/rocm/hip ..

CMake Configuration Issues

Problem: “CMake version too old” - Solution: Upgrade CMake to at least version 3.21. See notes/CMake.qmd for details.

Problem: CMake not found for Kokkos subproject - Solution: Ensure CMake is installed:

# Ubuntu/Debian
apt install cmake

# macOS
brew install cmake

Performance Considerations

Device Selection

  • NVIDIA A100: Best for general-purpose HPC, uses Kokkos_ARCH_AMPERE80
  • NVIDIA H100: Latest generation, uses Kokkos_ARCH_HOPPER90
  • AMD MI250: Best price-to-performance, uses Kokkos_ARCH_AMD_RDNA2
  • AMD MI300X: Latest, highest bandwidth, uses Kokkos_ARCH_AMD_RDNA3

Memory Management

Both CUDA and HIP support unified memory, but explicit device transfers via Kokkos::deep_copy often provide better performance:

// Allocate on device
Kokkos::View<double*> device_view("device", N);

// Copy from host to device
Kokkos::deep_copy(device_view, host_view);

// Copy from device to host
Kokkos::deep_copy(host_view, device_view);

Occupancy and Block Sizes

For optimal GPU utilization, consider: - GPU memory bandwidth vs. computation ratio - Thread block size (typically multiples of 32 for NVIDIA, 64 for AMD) - Register pressure and shared memory usage

Kokkos handles many of these automatically, but understanding these concepts helps optimize your kernels.

Further Resources