📐

Quantization Formulation

Mathematical foundations: from problem statement to near-optimal compression in b bits per weight.

Problem Statement

Given a pre-trained weight matrix , find a compressed representation that minimizes mean squared reconstruction error, subject to using only bits per element plus a small side-information budget (norms, codebook, seed).

Notation

Symbol	Meaning
	Full-precision weight matrix (M = out, N = in)
	Bit-width of the quantizer (L = 2^b levels)
	Group size (columns processed together)
	Number of groups
	Random orthogonal rotation for group g
	Row norm of group g, row m
	Scalar quantizer mapping to L centroids
	Lloyd-Max codebook (centroids)
	Decision boundaries (t_0 = -\infty, t_L = +\infty)

Single-Pass Pipeline

For each group and each row , the pipeline proceeds in five steps. Columns are partitioned into groups, and each group is quantized independently.

Row Normalization

Extract the row norm and normalize. After this step, and each component has expected magnitude .

Random Rotation

Apply a random orthogonal transform from the Haar measure on . Because is orthogonal, the norm is preserved and each component satisfies .

Variance Normalization

Scale to unit variance so each scalar satisfies — exactly matching the Lloyd-Max codebook distribution.

Lloyd-Max Quantization

Each scalar is independently quantized using the optimal boundaries. At 4 bits: 16 centroids, 15 decision boundaries.

Reconstruction

Undo the rotation and rescale by the stored norm to obtain the quantized approximation in the original coordinate space.

MSE Analysis

Because orthogonal rotation preserves the Frobenius norm, the per-element reconstruction error factors cleanly:

where is the distortion of the -bit Lloyd-Max quantizer on . The overall MSE is:

where is the average squared norm per weight element.

Lloyd-Max Distortion Values

b (bits)	L (levels)	D_b	SNR (dB)
1	2	0.3634	4.40
2	4	0.1175	9.30
3	8	0.03454	14.62
4	16	0.009497	20.22
5	32	0.002499	26.02

Each additional bit roughly halves the distortion (~6 dB improvement).

Lloyd-Max D_bShannon D*(R) = 2^−2R

Near-Optimality

The Shannon rate-distortion function for at rate bits is:

At bits, . The Lloyd-Max quantizer achieves , giving a gap of only:

Why rotation makes this possible

Without rotation, trained neural network weights are correlated and non-Gaussian. Scalar quantization operating per-coordinate leaves inter-coordinate redundancy unexploited. The random rotation decorrelates coordinates and projects them onto i.i.d. approximate Gaussians — reducing the problem to the case where scalar Lloyd-Max is near-optimal. The gap is only ~3.9 dB from the theoretical optimum and decreases for higher .

Residual Quantization

The single-pass error can be reduced by iteratively quantizing the reconstruction residual. Define the residual sequence:

After passes the total MSE is approximately:

Effective Bit-Rate Configurations

Config	Passes	Bits / weight
4-bit single	1	4
4+4 residual	2	8
4+2 residual	2	6
2+2+2+2	4	8

Rotation Strategies

The choice of rotation seed(s) across residual passes has a significant impact on error decorrelation and inference efficiency.

Strategy	Seeds	Advantage	Disadvantage
Independent		Errors projected onto different subspaces — maximizes error reduction per pass	Cannot merge passes in the rotated domain
Shared		Enables fast-path merge — reduces storage and inference cost to single-pass levels	Correlated errors across passes; reduced benefit per pass
Alternating	(two seeds)	Adjacent passes use different bases — near-independent error reduction with only two rotation matrices	Cannot use the fast-path merge

Inference

Instead of materializing the full weight matrix, inference operates in the rotated domain using the pre-rotate input trick.

For input and output , substitute the reconstruction:

Pre-rotating the input reduces the inner product to a lookup + dot product:

Operation	Cost	Comment
Input rotation	or	QR vs Hadamard
Codebook lookup		Index → centroid
Fused dot product		Sum of products
Norm rescaling		Multiply by

Total: per forward pass — same asymptotic cost as dense matmul, but with much smaller memory footprint.

Compact Summary

The TurboQuant quantization objective can be written compactly as:

where the single-pass operator is:

🔄

Rotation

Decorrelates weight coordinates → i.i.d. approximate 𝒩(0, 1/d)

📐

Normalization

Matches the codebook's design distribution → 𝒩(0, 1)

📊

Lloyd-Max

Optimal scalar quantizer for known distributions → minimal D_b

🎯

Residual passes

Exploit remaining structure → multiplicative MSE reduction

⚡

On-the-fly dequant

Preserves the b-bit memory advantage at inference

BPW Analysis

Beyond MSE, minimizing bits per weight (BPW) determines model size on disk and memory footprint. The total storage per weight element decomposes as:

Variables That Affect BPW

Variable	How it affects BPW	Default
Index bit-width (b)	Directly: BPW ∝ b	4
Residual passes (P)	Total index bits = Σ bₖ	1 or 2
Group size (d)	Norm overhead = 32/d per norm set	128
Norm precision	Overhead = p_norm/d (currently 32-bit)	float32
Number of norm sets	Each pass stores one set of norms	P
Non-quantized layers	Embeddings, LayerNorm, lm_head at full precision	model-dep

Budget Example: Qwen3.5-0.8B (4+2 residual)

Qwen3.5-0.8B · 4+2 residual · d=128~7.02 BPW

Pass 1 indices 4 (57%)

Pass 2 indices 2 (28.5%)

Pass 1 norms 0.25 (3.6%)

Pass 2 norms 0.25 (3.6%)

Non-quantized 0.52 (7.3%)

Research Directions

Seven ideas to push effective BPW lower while preserving quality. Combined potential: with fp16 norms, , 3+2 config, and entropy coding, effective BPW could reach ~4.5 at quality comparable to current 4+2 (7 BPW).

🔢

Reduce Norm Precision

−0.25 BPW

Store norms in float16/bfloat16 instead of float32, halving the overhead to BPW. Low risk for typical weight magnitudes.

📏

Larger Group Size

−0.13 BPW / pass

Increase from 128 to 256 or 512. The Gaussian approximation improves with larger (better concentration of measure).

⬇️

Sub-4-bit Primary

−1.0 BPW

Use 3-bit for the primary pass and rely on residual passes for quality. A 3+2 config achieves 5 BPW with distortion comparable to 4-bit single.

🎚️

Non-Uniform Bit Allocation

−0.5–1.0 BPW

Assign higher bits to sensitive layers (attention Q/K) and lower bits to less sensitive ones (MLP). Solvable via dynamic programming.

🗜️

Entropy Coding

−0.24 BPW

Lloyd-Max indices are non-uniform (inner levels more probable). Shannon entropy is ~3.76 bits vs 4 allocated. ANS or Huffman coding recovers the gap.

📉

Delta Norms

−0.19 BPW

Residual norms are much smaller. Store them as ratios of pass-1 norms, quantized to 8-bit.

🚫

Norm-Free Quantization

variable

Absorb the norm into a per-group scaled codebook: . Trades per-row norms for a single per-group scale factor.

← Dequantize Pipeline