Residual Quantization
Multi-pass quantization captures progressively finer detail. Each pass quantizes the error left by the previous pass.
Motivation
Single-pass quantization at bits has a fundamental error floor โ the Lloyd-Max distortion for at 4 bits. For LLMs, this translates to ~2 PPL degradation on Qwen3.5-0.8B. Residual quantization dramatically reduces this.
The Idea
Each pass captures progressively finer detail. The residual after pass has smaller magnitude, so even a coarse quantizer captures significant information.
Animated: Two-Pass Residual
Watch the signal decompose: Pass 1 captures coarse structure, the residual contains the fine details, and Pass 2 captures most of them.
Why 4+4 Beats Single-Pass 8-bit
Single-pass 8-bit
Allocates 256 levels uniformly across the whole dynamic range. Most levels wasted in low-density regions.
4+4 residual โจ
Pass 1: 16 levels optimized for original distribution. Pass 2: 16 levels optimized for the residual distribution. Two-stage allocation is far more efficient.
Results
| Config | Bits | PPL | KLD |
|---|---|---|---|
| Baseline bf16 | 16 | 14.29 | โ |
| 4+4 residual โจ | 8 | 14.28 | 0.0020 |
| 4+2 residual | 6 | 14.46 | 0.0159 |
| 3+2 residual | 5 | 15.15 | 0.0545 |
| 4-bit single | 4 | 16.58 | 0.1403 |