Quantization Formulation
Mathematical foundations: from problem statement to near-optimal compression in b bits per weight.
Problem Statement
Given a pre-trained weight matrix , find a compressed representation that minimizes mean squared reconstruction error, subject to using only bits per element plus a small side-information budget (norms, codebook, seed).
Notation
| Symbol | Meaning |
|---|---|
| Full-precision weight matrix (M = out, N = in) | |
| Bit-width of the quantizer (L = 2^b levels) | |
| Group size (columns processed together) | |
| Number of groups | |
| Random orthogonal rotation for group g | |
| Row norm of group g, row m | |
| Scalar quantizer mapping to L centroids | |
| Lloyd-Max codebook (centroids) | |
| Decision boundaries (t_0 = -\infty, t_L = +\infty) |
Single-Pass Pipeline
For each group and each row , the pipeline proceeds in five steps. Columns are partitioned into groups, and each group is quantized independently.
Row Normalization
Extract the row norm and normalize. After this step, and each component has expected magnitude .
Random Rotation
Apply a random orthogonal transform from the Haar measure on . Because is orthogonal, the norm is preserved and each component satisfies .
Variance Normalization
Scale to unit variance so each scalar satisfies โ exactly matching the Lloyd-Max codebook distribution.
Lloyd-Max Quantization
Each scalar is independently quantized using the optimal boundaries. At 4 bits: 16 centroids, 15 decision boundaries.
Reconstruction
Undo the rotation and rescale by the stored norm to obtain the quantized approximation in the original coordinate space.
MSE Analysis
Because orthogonal rotation preserves the Frobenius norm, the per-element reconstruction error factors cleanly:
where is the distortion of the -bit Lloyd-Max quantizer on . The overall MSE is:
where is the average squared norm per weight element.
Lloyd-Max Distortion Values
| b (bits) | L (levels) | Db | SNR (dB) |
|---|---|---|---|
| 1 | 2 | 0.3634 | 4.40 |
| 2 | 4 | 0.1175 | 9.30 |
| 3 | 8 | 0.03454 | 14.62 |
| 4 | 16 | 0.009497 | 20.22 |
| 5 | 32 | 0.002499 | 26.02 |
Each additional bit roughly halves the distortion (~6 dB improvement).
Near-Optimality
The Shannon rate-distortion function for at rate bits is:
At bits, . The Lloyd-Max quantizer achieves , giving a gap of only:
Why rotation makes this possible
Without rotation, trained neural network weights are correlated and non-Gaussian. Scalar quantization operating per-coordinate leaves inter-coordinate redundancy unexploited. The random rotation decorrelates coordinates and projects them onto i.i.d. approximate Gaussians โ reducing the problem to the case where scalar Lloyd-Max is near-optimal. The gap is only ~3.9 dB from the theoretical optimum and decreases for higher .
Residual Quantization
The single-pass error can be reduced by iteratively quantizing the reconstruction residual. Define the residual sequence:
After passes the total MSE is approximately:
Effective Bit-Rate Configurations
| Config | Passes | Bits / weight | Expected MSE ratio |
|---|---|---|---|
| 4-bit single | 1 | 4 | |
| 4+4 residual | 2 | 8 | |
| 4+2 residual | 2 | 6 | |
| 2+2+2+2 | 4 | 8 |
Rotation Strategies
The choice of rotation seed(s) across residual passes has a significant impact on error decorrelation and inference efficiency.
| Strategy | Seeds | Advantage | Disadvantage |
|---|---|---|---|
| Independent | Errors projected onto different subspaces โ maximizes error reduction per pass | Cannot merge passes in the rotated domain | |
| Shared | Enables fast-path merge โ reduces storage and inference cost to single-pass levels | Correlated errors across passes; reduced benefit per pass | |
| Alternating | (two seeds) | Adjacent passes use different bases โ near-independent error reduction with only two rotation matrices | Cannot use the fast-path merge |
Inference
Instead of materializing the full weight matrix, inference operates in the rotated domain using the pre-rotate input trick.
For input and output , substitute the reconstruction:
Pre-rotating the input reduces the inner product to a lookup + dot product:
| Operation | Cost | Comment |
|---|---|---|
| Input rotation | or | QR vs Hadamard |
| Codebook lookup | Index โ centroid | |
| Fused dot product | Sum of products | |
| Norm rescaling | Multiply by |
Total: per forward pass โ same asymptotic cost as dense matmul, but with much smaller memory footprint.
Compact Summary
The TurboQuant quantization objective can be written compactly as:
where the single-pass operator is:
Decorrelates weight coordinates โ i.i.d. approximate ๐ฉ(0, 1/d)
Matches the codebook's design distribution โ ๐ฉ(0, 1)
Optimal scalar quantizer for known distributions โ minimal D_b
Exploit remaining structure โ multiplicative MSE reduction
Preserves the b-bit memory advantage at inference
BPW Analysis
Beyond MSE, minimizing bits per weight (BPW) determines model size on disk and memory footprint. The total storage per weight element decomposes as:
Variables That Affect BPW
| Variable | How it affects BPW | Default |
|---|---|---|
| Index bit-width (b) | Directly: BPW โ b | 4 |
| Residual passes (P) | Total index bits = ฮฃ bโ | 1 or 2 |
| Group size (d) | Norm overhead = 32/d per norm set | 128 |
| Norm precision | Overhead = p_norm/d (currently 32-bit) | float32 |
| Number of norm sets | Each pass stores one set of norms | P |
| Non-quantized layers | Embeddings, LayerNorm, lm_head at full precision | model-dep |
Budget Example: Qwen3.5-0.8B (4+2 residual)
Research Directions
Seven ideas to push effective BPW lower while preserving quality. Combined potential: with fp16 norms, , 3+2 config, and entropy coding, effective BPW could reach ~4.5 at quality comparable to current 4+2 (7 BPW).
Store norms in float16/bfloat16 instead of float32, halving the overhead to BPW. Low risk for typical weight magnitudes.
Increase from 128 to 256 or 512. The Gaussian approximation improves with larger (better concentration of measure).
Use 3-bit for the primary pass and rely on residual passes for quality. A 3+2 config achieves 5 BPW with distortion comparable to 4-bit single.
Assign higher bits to sensitive layers (attention Q/K) and lower bits to less sensitive ones (MLP). Solvable via dynamic programming.
Lloyd-Max indices are non-uniform (inner levels more probable). Shannon entropy is ~3.76 bits vs 4 allocated. ANS or Huffman coding recovers the gap.
Residual norms are much smaller. Store them as ratios of pass-1 norms, quantized to 8-bit.
Absorb the norm into a per-group scaled codebook: . Trades per-row norms for a single per-group scale factor.