CUDA Learning Tutorial Series
A structured, hands-on curriculum for learning GPU programming with CUDA — from thread hierarchy to real ML kernel optimization. Built to fill the gap between documentation and production-grade GPU code.
Problem Statement
CUDA documentation is dense, and most tutorials leap from "hello world" to advanced kernels with little scaffolding. For ML engineers who need to write or optimize GPU kernels — not just call cuBLAS — there's a missing middle layer of practical, ML-relevant examples. This project builds that layer: a progressive series that teaches GPU programming through problems you'd actually encounter in ML systems work.
Why This Exists
- CUDA documentation is dense and most tutorials jump from "hello world" to advanced kernels with no scaffolding. Working on ML systems at Freshriver.ai exposed how far the gap is between documentation and production GPU code.
- Access to NYU's 8/16-GPU cluster on repurposed Meta server hardware created a real environment to test and profile kernels at scale.
- Built the curriculum I wished existed when I started — progressive, ML-relevant, and grounded in actual profiling data from Nsight Compute.
Curriculum Structure
- Module 1 — Thread blocks, grids, and the SIMT execution model
- Module 2 — Memory hierarchy: global, shared memory, registers, L1/L2 cache
- Module 3 — Matrix multiply from scratch: naive global memory → tiled shared memory
- Module 4 — Warp divergence, occupancy analysis, and Nsight Compute profiling
- Module 5 — ML kernel optimization: attention from scratch and FlashAttention concepts
Key insight: moving data from global to shared memory is the primary optimization lever for memory-bound kernels.
Key Concepts Covered
- SIMT execution model — how 32 threads execute in lockstep as a warp
- Shared memory — bank conflicts, tiling strategies for matrix ops
- Warp divergence — how branching kills occupancy and how to fix it
- Occupancy analysis — achieved vs theoretical occupancy, register pressure, shared memory limits
- CUDA streams and async execution — overlapping compute and memory transfers with cudaMemcpyAsync
// TILE_SIZE threads cooperatively load a tile of A and B __global__ void matmul_tiled(float* A, float* B, float* C, int N) { __shared__ float tileA[TILE][TILE], tileB[TILE][TILE]; int row = blockIdx.y * TILE + threadIdx.y; int col = blockIdx.x * TILE + threadIdx.x; float sum = 0.0f; for (int t = 0; t < N / TILE; ++t) { tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE + threadIdx.x]; tileB[threadIdx.y][threadIdx.x] = B[(t * TILE + threadIdx.y) * N + col]; __syncthreads(); // wait for all threads to load tile for (int k = 0; k < TILE; ++k) sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x]; __syncthreads(); // guard before next tile load } C[row * N + col] = sum; }