๐ŸŽฏ

Residual Quantization

Multi-pass quantization captures progressively finer detail. Each pass quantizes the error left by the previous pass.

Motivation

Single-pass quantization at bits has a fundamental error floor โ€” the Lloyd-Max distortion for at 4 bits. For LLMs, this translates to ~2 PPL degradation on Qwen3.5-0.8B. Residual quantization dramatically reduces this.

The Idea

Each pass captures progressively finer detail. The residual after pass has smaller magnitude, so even a coarse quantizer captures significant information.

Animated: Two-Pass Residual

Watch the signal decompose: Pass 1 captures coarse structure, the residual contains the fine details, and Pass 2 captures most of them.

Original Pass 1 Residual Reconstructed

Why 4+4 Beats Single-Pass 8-bit

Single-pass 8-bit

Allocates 256 levels uniformly across the whole dynamic range. Most levels wasted in low-density regions.

4+4 residual โœจ

Pass 1: 16 levels optimized for original distribution. Pass 2: 16 levels optimized for the residual distribution. Two-stage allocation is far more efficient.

Results

ConfigBitsPPLKLD
Baseline bf161614.29โ€”
4+4 residual โœจ814.280.0020
4+2 residual614.460.0159
3+2 residual515.150.0545
4-bit single416.580.1403

Implementation

residual.py โ†’ residual_quantize_packed() (independent seeds)
residual.py โ†’ multi_residual_quantize_packed() (shared seed)