TurboQuant

Applying the TurboQuant paper's rotation + Lloyd-Max framework to offline LLM weight compression — replacing QJL with multi-pass residual quantization and fused GPU kernels.

bits per weight

0×

compression ratio

0.0×

near optimal MSE

0.0×

memory savings

See How It Works Read the Paper

Paper → Practice

This project takes the core ideas from the TurboQuant paper and applies them to offline LLM weight compression — a different use case than the paper's primary focus.

📄 The Paper (Zandieh et al.)

Introduces online vector quantization with near-optimal distortion rate. Primary contribution is TurboQuant_prod — an unbiased inner product estimator combining Lloyd-Max quantization with a 1-bit QJL correction for KV-cache attention.

Online estimationQJL 1-bit correctionUnbiased dot productsKV-cache focus

⚡ This Project

Takes the paper's rotation + Lloyd-Max foundation and applies it to offline weight compression. Replaces QJL with multi-pass residual quantization using full Lloyd-Max codebooks, and adds fused GPU kernels for production inference.

Offline weight compressionNo QJLResidual quantizationFused CuTile/Triton kernels

🚫

No QJL

QJL's 1-bit unbiased estimator designed for streaming KV-cache is unnecessary for offline weight compression.

🎯

Residual Instead

Full Lloyd-Max passes on the residual error. 4+4 bits achieves KLD 0.002 — strictly dominating 1-bit QJL.

🔄

Rotate Input

Pre-rotate the activation (B×d, cheap) instead of inverse-rotating the weight matrix (M×N, expensive).

🚀

Fused Kernels

CuTile/Triton fuse unpack→lookup→matmul→rescale. 64-byte codebook in registers, zero round-trips.

Core Techniques

Six techniques combine to achieve near-information-theoretic-optimal weight compression. Click any card to explore in depth.

📊

Lloyd-Max Quantization

Optimal scalar quantizer for Gaussian distributions — 16 centroids, 64 bytes, shared globally.

🔄

Random Rotation

Orthogonal rotation decorrelates weights and Gaussianizes coordinates via the central limit theorem.

⚡

Walsh-Hadamard Transform

O(d log d) butterfly rotation with O(d) storage — faster and leaner than QR decomposition.

🎯

Residual Quantization

Multi-pass quantization captures progressively finer detail. 4+4 bits → KLD 0.002 (near-lossless).

📦

4-bit Packing

Two 4-bit indices per uint8 byte — halving storage with near-free bitwise pack/unpack.

🚀

Fused GPU Kernels

CuTile & Triton fuse unpack→lookup→matmul→rescale into one launch. 4× speedup, 5.7× memory savings.

�

QJL (Quantized JL)

1-bit random projection for unbiased inner product estimation — elegant for KV-cache, but not used in this project.

Pipelines

Quantization Pipeline

How TurboQuant compresses model weights from bf16/fp32 down to 4-bit packed indices. Normalize → Rotate → Scale → Quantize → Pack.

Explore→

Dequantization Pipeline

On-the-fly inference: rotate the input, not the weight. Fused kernel implementations with CuTile, Triton, and PyTorch fallback.

Explore→

Benchmark Highlights

14.28

4+4 Residual PPL

vs 14.29 baseline

0.002

KL Divergence

nats (near-lossless)

3.98×

CuTile Speedup

vs PyTorch fallback