Memory Performance: Caching and Coalescence

Author

Lars Pastewka

Overview

Modern CPUs and GPUs are incredibly fast at computation, but memory access is often the bottleneck in real applications. A single floating-point operation might complete in 1-2 nanoseconds, but fetching data from main memory can take hundreds of nanoseconds. This lecture explains why memory layout matters for performance and how CPUs and GPUs handle memory access differently through caching and memory coalescing.

By understanding these concepts, you’ll be able to: - Write code that’s cache-friendly on CPUs - Write code that uses GPU memory efficiently (coalescence) - Understand why the same algorithm can have vastly different performance on different hardware

Why Memory Matters

Let’s start with a concrete example. Consider this simple loop:

// Bad: random memory access
for (int i = 0; i < 1000000; i++) {
  int idx = rand() % N;  // Random index
  sum += data[idx];
}

// Good: sequential memory access
for (int i = 0; i < N; i++) {
  sum += data[i];
}

Both loops do the same computation (sum all elements), but the second one is typically 10-100x faster because of how memory access is organized. This difference comes from caching and memory hierarchies.

Memory Hierarchy

Modern computers have multiple levels of memory, each with different sizes and speeds:

                    Tiny (Bytes)     Very Fast (1-2 ns)    CPU Registers
                         ↓
                  Small (KBs)        Fast (3-5 ns)        L1 Cache
                         ↓
                 Medium (100s KB)    Medium (12-20 ns)    L2 Cache
                         ↓
              Large (MBs)           Slower (40-75 ns)     L3 Cache (Shared)
                         ↓
        Very Large (100s of GB)   Slow (100-300 ns)      Main Memory (RAM)
                         ↓
        Enormous (TBs)            Very Slow (ms)         Storage (Disk/SSD)

On CPUs, caching happens automatically. The hardware detects memory access patterns and keeps frequently accessed data in fast caches. But this automatic caching is limited—if you access data in a random pattern, the caches can’t help.

CPU Cache: Understanding Cache Lines

CPUs fetch memory in small chunks called cache lines (typically 64 bytes). When you access a single byte, the CPU fetches the entire cache line into the cache.

Here’s what happens:

Memory Layout:  [data[0]][data[1]][data[2]][data[3]]...[data[15]]...
                 ↑────────────── One cache line (64 bytes) ───────────↑

If you access data[0], you automatically get data[0] through data[15] (for 4-byte integers) in the cache.

Sequential Access (Cache-Friendly)

for (int i = 0; i < N; i++) {
  sum += data[i];
}

What happens: 1. Access data[0] → Load entire cache line into cache 2. Access data[1] through data[15] → Already in cache! Super fast (cache hit) 3. Access data[16] → Load new cache line 4. Continue…

Result: Cache hit rate is ~95%. Most accesses are fast because neighbors are already loaded.

Random Access (Cache-Unfriendly)

for (int i = 0; i < 1000000; i++) {
  int idx = random_index();
  sum += data[idx];
}

What happens: 1. Access random location → Load cache line 2. Access another random location → Usually NOT in cache, load new cache line 3. Repeat…

Result: Cache miss rate is ~95%. Almost every access requires loading from main memory (slow!).

CPU Cache Optimization: Multi-Dimensional Arrays

For multi-dimensional arrays, the layout in memory determines performance:

Row-Major Layout (C/C++ default)

// C++ array layout in memory (row-major)
// A[0][0], A[0][1], A[0][2], ..., A[1][0], A[1][1], ...

double A[M][N];

// GOOD: Access row-by-row (sequential in memory)
for (int i = 0; i < M; i++) {
  for (int j = 0; j < N; j++) {
    sum += A[i][j];  // Sequential access!
  }
}

// BAD: Access column-by-column (non-sequential in memory)
for (int j = 0; j < N; j++) {
  for (int i = 0; i < M; i++) {
    sum += A[i][j];  // Jumps around memory
  }
}

The first version is often 10x faster because accessing row-by-row follows memory layout. The second jumps by N * sizeof(double) bytes each time, missing the cache repeatedly.

Actual Performance

On a modern CPU with a matrix of 10,000 × 10,000 doubles (800 MB):

Row-major access: ~2 seconds
Column-major access: ~20 seconds
Difference: 10x speedup just from memory layout!

GPU Memory: Coalescence

GPUs handle memory very differently than CPUs. There’s no automatic caching like CPUs (well, newer GPUs have some, but it’s much smaller and less helpful). Instead, GPUs rely on memory coalescing.

What is Memory Coalescence?

When multiple GPU threads access memory simultaneously, the GPU groups these accesses together into a single transaction. This is efficient when: 1. Threads access consecutive memory addresses 2. The memory access pattern is predictable

Think of it like a bank teller: serving customers one-by-one is slow, but if multiple customers line up in order (addresses 0, 1, 2, … in sequence), the teller can process them all in one batch efficiently (coalesced access).

Coalesced Access (GPU-Friendly)

// GPU kernel - all threads access consecutive memory
Kokkos::parallel_for(N, KOKKOS_LAMBDA(int i) {
  result[i] = data[i] * 2.0;  // Thread i accesses data[i]
});

What happens: 1. Thread 0 wants data[0] 2. Thread 1 wants data[1] 3. Thread 2 wants data[2] 4. … Thread 31 wants data[31] 5. GPU groups these 32 requests → Load one memory transaction containing all 32 values 6. Result: 32 memory accesses served in 1 transaction (32x efficiency!)

Non-Coalesced Access (GPU-Unfriendly)

// AVOID: Threads access scattered memory
Kokkos::parallel_for(N, KOKKOS_LAMBDA(int i) {
  int idx = (i * STRIDE) % N;  // Large stride causes poor coalescence
  result[i] = data[idx] * 2.0;
});

What happens: 1. Thread 0 wants data[0] 2. Thread 1 wants data[STRIDE] 3. Thread 2 wants data[2*STRIDE] 4. … addresses are far apart 5. GPU can’t combine these into one transaction 6. Result: Each access is slow; GPU memory bandwidth is wasted

Memory Layout for GPUs

Similar to CPUs, memory layout matters on GPUs:

Good (contiguous data):

Kokkos::View<double*> data("data", N);  // Contiguous memory
Kokkos::parallel_for(N, KOKKOS_LAMBDA(int i) {
  sum += data[i];  // Coalesced access
});

Bad (scattered data):

std::vector<double*> pointers;  // Array of pointers to scattered memory
Kokkos::parallel_for(N, KOKKOS_LAMBDA(int i) {
  sum += *pointers[i];  // Indirect access, hard to coalesce
});

CPU vs. GPU Memory Requirements

Here’s a summary of how CPUs and GPUs differ:

Aspect	CPU	GPU
Memory Access Latency	High (100-300 ns)	Very High (400-1000 ns)
Cache Strategy	Automatic caching (large caches)	Manual coalescing (minimal cache)
Optimal Pattern	Sequential or temporal locality	Coalesced (all threads access nearby addresses)
Array Layout	Row-major: Access rows sequentially	Coalesced: Threads map to contiguous memory
Stride Tolerance	Can handle large strides (cache takes care of it)	Stride of 1 is best; larger strides kill performance
Bandwidth	Moderate (50-200 GB/s)	Very High (500+ GB/s) when coalesced

Practical Example: Matrix Multiplication

Let’s see how memory layout affects a real algorithm:

CPU-Optimized Version

// CPU version: iterate in cache-friendly order
for (int i = 0; i < M; i++) {
  for (int j = 0; j < K; j++) {
    double a_ij = A[i][j];
    for (int k = 0; k < N; k++) {
      C[i][k] += a_ij * B[j][k];  // Accesses C and B sequentially
    }
  }
}

Why is this cache-friendly? - Inner loop accesses B[j][k] and C[i][k] with stride 1 (sequential) - Data from one cache line is reused multiple times - Compiler can prefetch next cache lines

GPU-Optimized Version

// GPU version: organize computation for coalescence
Kokkos::View<double**> A("A", M, K);
Kokkos::View<double**> B("B", K, N);
Kokkos::View<double**> C("C", M, N);

// Tile-based approach for better memory access patterns
// (Note: Kokkos::Cuda here specifies NVIDIA GPU; for AMD use Kokkos::HIP)
int tile_size = 32;  // Typical block size

for (int i_tile = 0; i_tile < M; i_tile += tile_size) {
  for (int j_tile = 0; j_tile < N; j_tile += tile_size) {
    // Compute one tile of C with many threads
    Kokkos::parallel_for(
      Kokkos::MDRangePolicy<Kokkos::Cuda, Kokkos::Rank<2>>(
        {i_tile, j_tile}, {i_tile + tile_size, j_tile + tile_size}),
      KOKKOS_LAMBDA(int i, int j) {
        double sum = 0.0;
        for (int k = 0; k < K; k++) {
          sum += A(i, k) * B(k, j);
        }
        C(i, j) += sum;
      });
  }
}

Why is this good for GPU? - Threads in a block compute nearby output elements - Accessing A(i, k) means thread i accesses row i (shared across threads) - Accessing B(k, j) means thread j accesses similar j indices (near-coalesced) - Multiple threads share data, reducing bandwidth requirements

Memory Layout in Kokkos

Kokkos provides different layouts to control memory organization:

Layout Right (Row-Major, Default)

Kokkos::View<double**> A("A", M, N);  // Default: layout_right
// Memory layout: [A(0,0)][A(0,1)]...[A(0,N-1)][A(1,0)][A(1,1)]...
// Good for: Sequential row access

Layout Left (Column-Major)

Kokkos::View<double**, Kokkos::LayoutLeft> B("B", M, N);
// Memory layout: [B(0,0)][B(1,0)]...[B(M-1,0)][B(0,1)][B(1,1)]...
// Good for: Sequential column access, Fortran compatibility

Choosing the Right Layout

Layout Right (default): Use this for most cases on CPUs and GPUs
Layout Left: Use if your algorithm naturally accesses columns (rare), or interfacing with Fortran code

// CPU-friendly: row-major
Kokkos::View<double**> A("A", M, N);
for (int i = 0; i < M; i++) {
  for (int j = 0; j < N; j++) {
    sum += A(i, j);
  }
}

// GPU-friendly: still row-major, but ensure coalesced access through kernel design
Kokkos::View<double**> B("B", M, N);
Kokkos::parallel_for(
  Kokkos::MDRangePolicy<Kokkos::Cuda, Kokkos::Rank<2>>({0, 0}, {M, N}),
  KOKKOS_LAMBDA(int i, int j) {
    result(i, j) = B(i, j) * 2.0;
  });

Common Pitfalls

1. Column-Wise Access on CPU

// ❌ WRONG: Column-wise access on CPU (cache misses)
for (int j = 0; j < N; j++) {
  for (int i = 0; i < M; i++) {
    sum += A[i][j];  // Jumps by M*sizeof(double) each time
  }
}

// ✓ CORRECT: Row-wise access on CPU
for (int i = 0; i < M; i++) {
  for (int j = 0; j < N; j++) {
    sum += A[i][j];  // Sequential memory access
  }
}

2. Large Strides on GPU

// ❌ WRONG: Large strides prevent coalescing
Kokkos::parallel_for(N, KOKKOS_LAMBDA(int i) {
  int idx = i * large_stride;  // Thread i accesses far apart addresses
  result[i] = data[idx];
});

// ✓ CORRECT: Unit stride for coalescing
Kokkos::parallel_for(N, KOKKOS_LAMBDA(int i) {
  result[i] = data[i];  // Thread i accesses data[i], perfectly coalesced
});

3. Indirect Memory Access

// ❌ WRONG: Unpredictable access pattern
Kokkos::parallel_for(N, KOKKOS_LAMBDA(int i) {
  int idx = indices[i];  // idx values are random
  sum += data[idx];
});

// ✓ BETTER: Predictable access pattern or restructure data
// Sort indices first, or use data layout that matches access pattern
for (int i = 0; i < N; i++) {
  sum += data[i];  // Sequential access
}

Measuring Memory Performance

CPU: Cache Misses

Use profiling tools to measure cache behavior:

# Linux: Use perf to measure cache misses
perf stat -e cache-references,cache-misses ./program

# Output might show:
# 1,234,567 cache-references (2.15%)
#   123,456 cache-misses   (10.0% of all cache refs)

Lower cache-miss percentage is better. Aim for < 5%.

GPU: Memory Bandwidth

Check if your GPU code is bandwidth-bound:

// Measure time and data transferred
double bandwidth_gb_s = (2 * N * sizeof(double)) / (time_seconds * 1e9);

// Compare to peak bandwidth
// A100: 2 TB/s peak
// RTX 3090: 930 GB/s peak
// If measured << peak, your code is not coalescing well

Guidelines for Writing Performant Code

These guidelines help you write code that performs well on real hardware. Following them can mean the difference between code that runs in seconds vs hours!

For CPUs

Access data sequentially when possible - The cache is your friend for sequential patterns
Iterate over fastest-changing dimension last (for row-major arrays) - This matches how data is stored in memory
Avoid random memory access - Unpredictable patterns defeat the hardware’s prefetching mechanisms
Consider blocking/tiling for large arrays - Keep frequently-used data in the fast L1/L2 caches
Reuse data within inner loops - Each byte of data should be used multiple times after loading

For GPUs

Ensure coalesced memory access - All threads in a warp should access nearby memory; this is critical for performance
Use unit stride when possible (thread i accesses element i) - This is the natural coalescing pattern
Avoid indirect access - If you must use it, try to restructure your data to make accesses more predictable
Use shared memory (advanced) - Cache frequently accessed data locally to reduce global memory traffic
Ensure sufficient parallelism - Thousands of threads allow the GPU to hide memory latency by context-switching

For Portable Code (Both CPU and GPU with Kokkos)

Use contiguous data structures (Kokkos::View) instead of pointers or STL containers - Kokkos can manage memory layout optimally
Let Kokkos choose memory layout (layout_right by default works well for both) - Trust the default; only change if profiling shows it’s needed
Write kernels that work well with both - Sequential-access CPU code often coalesces fine on GPU naturally
Profile on both platforms - Performance characteristics can differ significantly; what’s fast on CPU may be slow on GPU
Use Kokkos execution space abstractions - This lets the same code run on different hardware without modification