TurboQuant
Applying the TurboQuant paper's rotation + Lloyd-Max framework to offline LLM weight compression — replacing QJL with multi-pass residual quantization and fused GPU kernels.
Paper → Practice
This project takes the core ideas from the TurboQuant paper and applies them to offline LLM weight compression — a different use case than the paper's primary focus.
📄 The Paper (Zandieh et al.)
Introduces online vector quantization with near-optimal distortion rate. Primary contribution is TurboQuantprod — an unbiased inner product estimator combining Lloyd-Max quantization with a 1-bit QJL correction for KV-cache attention.
⚡ This Project
Takes the paper's rotation + Lloyd-Max foundation and applies it to offline weight compression. Replaces QJL with multi-pass residual quantization using full Lloyd-Max codebooks, and adds fused GPU kernels for production inference.
No QJL
QJL's 1-bit unbiased estimator designed for streaming KV-cache is unnecessary for offline weight compression.
Residual Instead
Full Lloyd-Max passes on the residual error. 4+4 bits achieves KLD 0.002 — strictly dominating 1-bit QJL.
Rotate Input
Pre-rotate the activation (B×d, cheap) instead of inverse-rotating the weight matrix (M×N, expensive).
Fused Kernels
CuTile/Triton fuse unpack→lookup→matmul→rescale. 64-byte codebook in registers, zero round-trips.
Core Techniques
Six techniques combine to achieve near-information-theoretic-optimal weight compression. Click any card to explore in depth.