🎯

Residual Quantization

Multi-pass quantization captures progressively finer detail. Each pass quantizes the error left by the previous pass.

Motivation

Single-pass quantization at bits has a fundamental error floor — the Lloyd-Max distortion for at 4 bits. For LLMs, this translates to ~2 PPL degradation on Qwen3.5-0.8B. Residual quantization dramatically reduces this.

The Idea

Each pass captures progressively finer detail. The residual after pass has smaller magnitude, so even a coarse quantizer captures significant information.

Animated: Two-Pass Residual

Watch the signal decompose: Pass 1 captures coarse structure, the residual contains the fine details, and Pass 2 captures most of them.

Original Pass 1 Residual Reconstructed

Why 4+4 Beats Single-Pass 8-bit

Single-pass 8-bit

Allocates 256 levels uniformly across the whole dynamic range. Most levels wasted in low-density regions.

4+4 residual ✨

Pass 1: 16 levels optimized for original distribution. Pass 2: 16 levels optimized for the residual distribution. Two-stage allocation is far more efficient.

Results

Config	Bits	PPL	KLD
Baseline bf16	16	14.29	—
4+4 residual ✨	8	14.28	0.0020
4+2 residual	6	14.46	0.0159
3+2 residual	5	15.15	0.0545
4-bit single	4	16.58	0.1403

Implementation

residual.py → residual_quantize_packed() (independent seeds)

residual.py → multi_residual_quantize_packed() (shared seed)

← Walsh-Hadamard 4-bit Packing →