TurboQuant

Applying the TurboQuant paper's rotation + Lloyd-Max framework to offline LLM weight compression — replacing QJL with multi-pass residual quantization and fused GPU kernels.

0
bits per weight
0×
compression ratio
0.0×
near optimal MSE
0.0×
memory savings

Paper → Practice

This project takes the core ideas from the TurboQuant paper and applies them to offline LLM weight compression — a different use case than the paper's primary focus.

📄 The Paper (Zandieh et al.)

Introduces online vector quantization with near-optimal distortion rate. Primary contribution is TurboQuantprod — an unbiased inner product estimator combining Lloyd-Max quantization with a 1-bit QJL correction for KV-cache attention.

Online estimationQJL 1-bit correctionUnbiased dot productsKV-cache focus

⚡ This Project

Takes the paper's rotation + Lloyd-Max foundation and applies it to offline weight compression. Replaces QJL with multi-pass residual quantization using full Lloyd-Max codebooks, and adds fused GPU kernels for production inference.

Offline weight compressionNo QJLResidual quantizationFused CuTile/Triton kernels
🚫

No QJL

QJL's 1-bit unbiased estimator designed for streaming KV-cache is unnecessary for offline weight compression.

🎯

Residual Instead

Full Lloyd-Max passes on the residual error. 4+4 bits achieves KLD 0.002 — strictly dominating 1-bit QJL.

🔄

Rotate Input

Pre-rotate the activation (B×d, cheap) instead of inverse-rotating the weight matrix (M×N, expensive).

🚀

Fused Kernels

CuTile/Triton fuse unpack→lookup→matmul→rescale. 64-byte codebook in registers, zero round-trips.

Benchmark Highlights

14.28
4+4 Residual PPL
vs 14.29 baseline
0.002
KL Divergence
nats (near-lossless)
3.98×
CuTile Speedup
vs PyTorch fallback