Projects  /  Systems Engineering

C++ Optimized ML Orderbook

High-performance limit orderbook engine with integrated ML signal layer — built in C++17 for microsecond-latency market simulation. The matching engine and ML inference live in the same memory space.

C++17 CUDA Python CMake In Progress
GitHub ↗

Problem Statement

Traditional backtesting frameworks are bottlenecked by Python's GIL and interpreted overhead. Real-world HFT systems operate at sub-millisecond latency — a Python orderbook will never get you there. This project explores how far you can push orderbook performance when ML inference lives next to the matching engine in the same memory space, with no serialization boundary between them.

Architecture

System Architecture
Order Ingestion
Lock-free SPSC queue
atomic compare-and-swap
Matching Engine
Price-level sorted maps
arena allocator, zero-heap hot path
ML Signal Layer
LGBM via ONNX Runtime
<2μs inference, same memory space
Decision Gate
ML signal gates order placement based on predicted short-horizon mid-price direction
  • Price-level sorted maps with custom arena allocators — zero heap allocation on the order book hot path
  • ML component: LightGBM model trained on L2 order book features, loaded and served via ONNX Runtime C++ API
  • ML signal feeds a pre-submission decision layer: predicted mid-price direction gates whether an order is placed or held
  • Benchmarked with micro-benchmarks timing individual match cycles; profiled with perf and Valgrind/callgrind

Key Design Decisions

  • C++17 over Rust or Go — STL ecosystem compatibility (std::map, STL algorithms) and seamless integration with ONNX Runtime's C++ API; no FFI overhead
  • Memory management — RAII throughout; arena allocator for order objects, custom pool allocator for price-level nodes eliminates allocation jitter
  • ML model choice — LGBM over neural nets: tree inference is deterministic and cache-friendly, consistently under 2μs, unlike transformer-style inference
  • Concurrency — lock-free SPSC queue for order ingestion feeds a single-writer order book; avoids mutex contention on the critical path
Code Highlight — Arena Allocator Hot Path
// Zero-allocation order insertion via arena
struct Order { uint64_t id; double price; uint32_t qty; Side side; };

class ArenaAllocator {
  alignas(64) uint8_t buf_[POOL_SIZE];
  size_t offset_ = 0;
public:
  Order* alloc() noexcept {
    return new(buf_ + (offset_++ & MASK) * sizeof(Order)) Order{};
  }
};

Results & Metrics

in prog. P99 match latency
benchmarking in progress
in prog. Orders/sec
throughput test pending
in prog. Speedup vs Python
baseline comparison pending
  • Latency benchmarks vs a pure Python orderbook in progress — expected 100–500× speedup based on prior C++/Python comparisons
  • ML signal directional accuracy and P&L simulation on held-out L2 data planned for next milestone
  • Memory footprint profiled with Valgrind massif; arena allocator eliminates GC-style jitter on the hot path

Tech Stack

C++17 CUDA Python CMake ONNX Runtime LightGBM Custom Arena Allocator
View on GitHub ↗ ← All Projects