Memory Performance: Caching and Coalescence
Overview
Modern CPUs and GPUs are incredibly fast at computation, but memory access is often the bottleneck in real applications. A single floating-point operation might complete in 1-2 nanoseconds, but fetching data from main memory can take hundreds of nanoseconds. This lecture explains why memory layout matters for performance and how CPUs and GPUs handle memory access differently through caching and memory coalescing.
By understanding these concepts, you’ll be able to: - Write code that’s cache-friendly on CPUs - Write code that uses GPU memory efficiently (coalescence) - Understand why the same algorithm can have vastly different performance on different hardware
Why Memory Matters
Let’s start with a concrete example. Consider this simple loop:
// Bad: random memory access
for (int i = 0; i < 1000000; i++) {
int idx = rand() % N; // Random index
sum += data[idx];
}
// Good: sequential memory access
for (int i = 0; i < N; i++) {
sum += data[i];
}Both loops do the same computation (sum all elements), but the second one is typically 10-100x faster because of how memory access is organized. This difference comes from caching and memory hierarchies.
Memory Hierarchy
Modern computers have multiple levels of memory, each with different sizes and speeds:
Tiny (Bytes) Very Fast (1-2 ns) CPU Registers
↓
Small (KBs) Fast (3-5 ns) L1 Cache
↓
Medium (100s KB) Medium (12-20 ns) L2 Cache
↓
Large (MBs) Slower (40-75 ns) L3 Cache (Shared)
↓
Very Large (100s of GB) Slow (100-300 ns) Main Memory (RAM)
↓
Enormous (TBs) Very Slow (ms) Storage (Disk/SSD)
On CPUs, caching happens automatically. The hardware detects memory access patterns and keeps frequently accessed data in fast caches. But this automatic caching is limited—if you access data in a random pattern, the caches can’t help.
CPU Cache: Understanding Cache Lines
CPUs fetch memory in small chunks called cache lines (typically 64 bytes). When you access a single byte, the CPU fetches the entire cache line into the cache.
Here’s what happens:
Memory Layout: [data[0]][data[1]][data[2]][data[3]]...[data[15]]...
↑────────────── One cache line (64 bytes) ───────────↑
If you access data[0], you automatically get data[0] through data[15] (for 4-byte integers) in the cache.
Sequential Access (Cache-Friendly)
for (int i = 0; i < N; i++) {
sum += data[i];
}What happens: 1. Access data[0] → Load entire cache line into cache 2. Access data[1] through data[15] → Already in cache! Super fast (cache hit) 3. Access data[16] → Load new cache line 4. Continue…
Result: Cache hit rate is ~95%. Most accesses are fast because neighbors are already loaded.
Random Access (Cache-Unfriendly)
for (int i = 0; i < 1000000; i++) {
int idx = random_index();
sum += data[idx];
}What happens: 1. Access random location → Load cache line 2. Access another random location → Usually NOT in cache, load new cache line 3. Repeat…
Result: Cache miss rate is ~95%. Almost every access requires loading from main memory (slow!).
CPU Cache Optimization: Multi-Dimensional Arrays
For multi-dimensional arrays, the layout in memory determines performance:
Row-Major Layout (C/C++ default)
// C++ array layout in memory (row-major)
// A[0][0], A[0][1], A[0][2], ..., A[1][0], A[1][1], ...
double A[M][N];
// GOOD: Access row-by-row (sequential in memory)
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
sum += A[i][j]; // Sequential access!
}
}
// BAD: Access column-by-column (non-sequential in memory)
for (int j = 0; j < N; j++) {
for (int i = 0; i < M; i++) {
sum += A[i][j]; // Jumps around memory
}
}The first version is often 10x faster because accessing row-by-row follows memory layout. The second jumps by N * sizeof(double) bytes each time, missing the cache repeatedly.
Actual Performance
On a modern CPU with a matrix of 10,000 × 10,000 doubles (800 MB):
- Row-major access: ~2 seconds
- Column-major access: ~20 seconds
- Difference: 10x speedup just from memory layout!
GPU Memory: Coalescence
GPUs handle memory very differently than CPUs. There’s no automatic caching like CPUs (well, newer GPUs have some, but it’s much smaller and less helpful). Instead, GPUs rely on memory coalescing.
What is Memory Coalescence?
When multiple GPU threads access memory simultaneously, the GPU groups these accesses together into a single transaction. This is efficient when: 1. Threads access consecutive memory addresses 2. The memory access pattern is predictable
Think of it like a bank teller: serving customers one-by-one is slow, but if multiple customers line up in order (addresses 0, 1, 2, … in sequence), the teller can process them all in one batch efficiently (coalesced access).
Coalesced Access (GPU-Friendly)
// GPU kernel - all threads access consecutive memory
Kokkos::parallel_for(N, KOKKOS_LAMBDA(int i) {
result[i] = data[i] * 2.0; // Thread i accesses data[i]
});What happens: 1. Thread 0 wants data[0] 2. Thread 1 wants data[1] 3. Thread 2 wants data[2] 4. … Thread 31 wants data[31] 5. GPU groups these 32 requests → Load one memory transaction containing all 32 values 6. Result: 32 memory accesses served in 1 transaction (32x efficiency!)
Non-Coalesced Access (GPU-Unfriendly)
// AVOID: Threads access scattered memory
Kokkos::parallel_for(N, KOKKOS_LAMBDA(int i) {
int idx = (i * STRIDE) % N; // Large stride causes poor coalescence
result[i] = data[idx] * 2.0;
});What happens: 1. Thread 0 wants data[0] 2. Thread 1 wants data[STRIDE] 3. Thread 2 wants data[2*STRIDE] 4. … addresses are far apart 5. GPU can’t combine these into one transaction 6. Result: Each access is slow; GPU memory bandwidth is wasted
Memory Layout for GPUs
Similar to CPUs, memory layout matters on GPUs:
Good (contiguous data):
Kokkos::View<double*> data("data", N); // Contiguous memory
Kokkos::parallel_for(N, KOKKOS_LAMBDA(int i) {
sum += data[i]; // Coalesced access
});Bad (scattered data):
std::vector<double*> pointers; // Array of pointers to scattered memory
Kokkos::parallel_for(N, KOKKOS_LAMBDA(int i) {
sum += *pointers[i]; // Indirect access, hard to coalesce
});CPU vs. GPU Memory Requirements
Here’s a summary of how CPUs and GPUs differ:
| Aspect | CPU | GPU |
|---|---|---|
| Memory Access Latency | High (100-300 ns) | Very High (400-1000 ns) |
| Cache Strategy | Automatic caching (large caches) | Manual coalescing (minimal cache) |
| Optimal Pattern | Sequential or temporal locality | Coalesced (all threads access nearby addresses) |
| Array Layout | Row-major: Access rows sequentially | Coalesced: Threads map to contiguous memory |
| Stride Tolerance | Can handle large strides (cache takes care of it) | Stride of 1 is best; larger strides kill performance |
| Bandwidth | Moderate (50-200 GB/s) | Very High (500+ GB/s) when coalesced |
Practical Example: Matrix Multiplication
Let’s see how memory layout affects a real algorithm:
CPU-Optimized Version
// CPU version: iterate in cache-friendly order
for (int i = 0; i < M; i++) {
for (int j = 0; j < K; j++) {
double a_ij = A[i][j];
for (int k = 0; k < N; k++) {
C[i][k] += a_ij * B[j][k]; // Accesses C and B sequentially
}
}
}Why is this cache-friendly? - Inner loop accesses B[j][k] and C[i][k] with stride 1 (sequential) - Data from one cache line is reused multiple times - Compiler can prefetch next cache lines
GPU-Optimized Version
// GPU version: organize computation for coalescence
Kokkos::View<double**> A("A", M, K);
Kokkos::View<double**> B("B", K, N);
Kokkos::View<double**> C("C", M, N);
// Tile-based approach for better memory access patterns
// (Note: Kokkos::Cuda here specifies NVIDIA GPU; for AMD use Kokkos::HIP)
int tile_size = 32; // Typical block size
for (int i_tile = 0; i_tile < M; i_tile += tile_size) {
for (int j_tile = 0; j_tile < N; j_tile += tile_size) {
// Compute one tile of C with many threads
Kokkos::parallel_for(
Kokkos::MDRangePolicy<Kokkos::Cuda, Kokkos::Rank<2>>(
{i_tile, j_tile}, {i_tile + tile_size, j_tile + tile_size}),
KOKKOS_LAMBDA(int i, int j) {
double sum = 0.0;
for (int k = 0; k < K; k++) {
sum += A(i, k) * B(k, j);
}
C(i, j) += sum;
});
}
}Why is this good for GPU? - Threads in a block compute nearby output elements - Accessing A(i, k) means thread i accesses row i (shared across threads) - Accessing B(k, j) means thread j accesses similar j indices (near-coalesced) - Multiple threads share data, reducing bandwidth requirements
Memory Layout in Kokkos
Kokkos provides different layouts to control memory organization:
Layout Right (Row-Major, Default)
Kokkos::View<double**> A("A", M, N); // Default: layout_right
// Memory layout: [A(0,0)][A(0,1)]...[A(0,N-1)][A(1,0)][A(1,1)]...
// Good for: Sequential row accessLayout Left (Column-Major)
Kokkos::View<double**, Kokkos::LayoutLeft> B("B", M, N);
// Memory layout: [B(0,0)][B(1,0)]...[B(M-1,0)][B(0,1)][B(1,1)]...
// Good for: Sequential column access, Fortran compatibilityChoosing the Right Layout
- Layout Right (default): Use this for most cases on CPUs and GPUs
- Layout Left: Use if your algorithm naturally accesses columns (rare), or interfacing with Fortran code
// CPU-friendly: row-major
Kokkos::View<double**> A("A", M, N);
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
sum += A(i, j);
}
}
// GPU-friendly: still row-major, but ensure coalesced access through kernel design
Kokkos::View<double**> B("B", M, N);
Kokkos::parallel_for(
Kokkos::MDRangePolicy<Kokkos::Cuda, Kokkos::Rank<2>>({0, 0}, {M, N}),
KOKKOS_LAMBDA(int i, int j) {
result(i, j) = B(i, j) * 2.0;
});Common Pitfalls
1. Column-Wise Access on CPU
// ❌ WRONG: Column-wise access on CPU (cache misses)
for (int j = 0; j < N; j++) {
for (int i = 0; i < M; i++) {
sum += A[i][j]; // Jumps by M*sizeof(double) each time
}
}
// ✓ CORRECT: Row-wise access on CPU
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
sum += A[i][j]; // Sequential memory access
}
}2. Large Strides on GPU
// ❌ WRONG: Large strides prevent coalescing
Kokkos::parallel_for(N, KOKKOS_LAMBDA(int i) {
int idx = i * large_stride; // Thread i accesses far apart addresses
result[i] = data[idx];
});
// ✓ CORRECT: Unit stride for coalescing
Kokkos::parallel_for(N, KOKKOS_LAMBDA(int i) {
result[i] = data[i]; // Thread i accesses data[i], perfectly coalesced
});3. Indirect Memory Access
// ❌ WRONG: Unpredictable access pattern
Kokkos::parallel_for(N, KOKKOS_LAMBDA(int i) {
int idx = indices[i]; // idx values are random
sum += data[idx];
});
// ✓ BETTER: Predictable access pattern or restructure data
// Sort indices first, or use data layout that matches access pattern
for (int i = 0; i < N; i++) {
sum += data[i]; // Sequential access
}Measuring Memory Performance
CPU: Cache Misses
Use profiling tools to measure cache behavior:
# Linux: Use perf to measure cache misses
perf stat -e cache-references,cache-misses ./program
# Output might show:
# 1,234,567 cache-references (2.15%)
# 123,456 cache-misses (10.0% of all cache refs)Lower cache-miss percentage is better. Aim for < 5%.
GPU: Memory Bandwidth
Check if your GPU code is bandwidth-bound:
// Measure time and data transferred
double bandwidth_gb_s = (2 * N * sizeof(double)) / (time_seconds * 1e9);
// Compare to peak bandwidth
// A100: 2 TB/s peak
// RTX 3090: 930 GB/s peak
// If measured << peak, your code is not coalescing wellGuidelines for Writing Performant Code
These guidelines help you write code that performs well on real hardware. Following them can mean the difference between code that runs in seconds vs hours!
For CPUs
- Access data sequentially when possible - The cache is your friend for sequential patterns
- Iterate over fastest-changing dimension last (for row-major arrays) - This matches how data is stored in memory
- Avoid random memory access - Unpredictable patterns defeat the hardware’s prefetching mechanisms
- Consider blocking/tiling for large arrays - Keep frequently-used data in the fast L1/L2 caches
- Reuse data within inner loops - Each byte of data should be used multiple times after loading
For GPUs
- Ensure coalesced memory access - All threads in a warp should access nearby memory; this is critical for performance
- Use unit stride when possible (thread i accesses element i) - This is the natural coalescing pattern
- Avoid indirect access - If you must use it, try to restructure your data to make accesses more predictable
- Use shared memory (advanced) - Cache frequently accessed data locally to reduce global memory traffic
- Ensure sufficient parallelism - Thousands of threads allow the GPU to hide memory latency by context-switching
For Portable Code (Both CPU and GPU with Kokkos)
- Use contiguous data structures (Kokkos::View) instead of pointers or STL containers - Kokkos can manage memory layout optimally
- Let Kokkos choose memory layout (layout_right by default works well for both) - Trust the default; only change if profiling shows it’s needed
- Write kernels that work well with both - Sequential-access CPU code often coalesces fine on GPU naturally
- Profile on both platforms - Performance characteristics can differ significantly; what’s fast on CPU may be slow on GPU
- Use Kokkos execution space abstractions - This lets the same code run on different hardware without modification