← Projects / GPU Programming

CUDA Learning Tutorial Series

A structured, hands-on curriculum for learning GPU programming with CUDA — from thread hierarchy to real ML kernel optimization. Built to fill the gap between documentation and production-grade GPU code.

CUDA C++ NVCC Python Nsight In Progress

Live Demo ↗ GitHub ↗

Problem Statement

CUDA documentation is dense, and most tutorials leap from "hello world" to advanced kernels with little scaffolding. For ML engineers who need to write or optimize GPU kernels — not just call cuBLAS — there's a missing middle layer of practical, ML-relevant examples. This project builds that layer: a progressive series that teaches GPU programming through problems you'd actually encounter in ML systems work.

Why This Exists

CUDA documentation is dense and most tutorials jump from "hello world" to advanced kernels with no scaffolding. Working on ML systems at Freshriver.ai exposed how far the gap is between documentation and production GPU code.
Access to NYU's 8/16-GPU cluster on repurposed Meta server hardware created a real environment to test and profile kernels at scale.
Built the curriculum I wished existed when I started — progressive, ML-relevant, and grounded in actual profiling data from Nsight Compute.

Curriculum Structure

Module 1 — Thread blocks, grids, and the SIMT execution model
Module 2 — Memory hierarchy: global, shared memory, registers, L1/L2 cache
Module 3 — Matrix multiply from scratch: naive global memory → tiled shared memory
Module 4 — Warp divergence, occupancy analysis, and Nsight Compute profiling
Module 5 — ML kernel optimization: attention from scratch and FlashAttention concepts

GPU Memory Hierarchy

Registers ~255 per thread · <1 cycle

Shared Memory (SRAM) 48–96 KB/SM · ~5 cycles

L1 / L2 Cache L1: 32 KB/SM · L2: 40 MB · ~20–80 cycles

Global Memory (HBM) 80 GB · ~400–800 cycles · 2 TB/s bandwidth

Key insight: moving data from global to shared memory is the primary optimization lever for memory-bound kernels.

Key Concepts Covered

SIMT execution model — how 32 threads execute in lockstep as a warp
Shared memory — bank conflicts, tiling strategies for matrix ops
Warp divergence — how branching kills occupancy and how to fix it
Occupancy analysis — achieved vs theoretical occupancy, register pressure, shared memory limits
CUDA streams and async execution — overlapping compute and memory transfers with cudaMemcpyAsync

Code Highlight — Tiled Matrix Multiply Kernel

// TILE_SIZE threads cooperatively load a tile of A and B
__global__ void matmul_tiled(float* A, float* B, float* C, int N) {
  __shared__ float tileA[TILE][TILE], tileB[TILE][TILE];
  int row = blockIdx.y * TILE + threadIdx.y;
  int col = blockIdx.x * TILE + threadIdx.x;
  float sum = 0.0f;
  for (int t = 0; t < N / TILE; ++t) {
    tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE + threadIdx.x];
    tileB[threadIdx.y][threadIdx.x] = B[(t * TILE + threadIdx.y) * N + col];
    __syncthreads();  // wait for all threads to load tile
    for (int k = 0; k < TILE; ++k) sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
    __syncthreads();  // guard before next tile load
  }
  C[row * N + col] = sum;
}

Tech Stack

CUDA C++ NVCC Python Nsight Compute pybind11 PyTorch C++ Extensions

Live Site ↗ View on GitHub ↗ ← All Projects