Projects
ML systems, GPU programming, and quantitative finance research.
Focused on performance, depth, and measurable results.
Microsecond-latency limit orderbook engine with an integrated ML signal layer, built in C++17. The matching engine and inference pipeline share the same memory space — no serialization overhead, no Python GIL.
A structured, hands-on GPU programming curriculum bridging hello world and advanced ML kernel optimization. Covers SIMT execution, shared memory, warp divergence, and matrix multiply from scratch.
Enabled NYU's access to 8/16-GPU AI training clusters on repurposed Meta server hardware. Designed workload scheduling architecture and documented infrastructure for student researchers.
Currently building
The C++ orderbook and CUDA tutorial series are both active. Case studies will be updated with benchmarks, architecture diagrams, and code walkthroughs as they ship.
View GitHub