๐Ÿ“

Quantization Formulation

Mathematical foundations: from problem statement to near-optimal compression in b bits per weight.

Problem Statement

Given a pre-trained weight matrix , find a compressed representation that minimizes mean squared reconstruction error, subject to using only bits per element plus a small side-information budget (norms, codebook, seed).

Notation

SymbolMeaning
Full-precision weight matrix (M = out, N = in)
Bit-width of the quantizer (L = 2^b levels)
Group size (columns processed together)
Number of groups
Random orthogonal rotation for group g
Row norm of group g, row m
Scalar quantizer mapping to L centroids
Lloyd-Max codebook (centroids)
Decision boundaries (t_0 = -\infty, t_L = +\infty)

Single-Pass Pipeline

For each group and each row , the pipeline proceeds in five steps. Columns are partitioned into groups, and each group is quantized independently.

1

Row Normalization

Extract the row norm and normalize. After this step, and each component has expected magnitude .

2

Random Rotation

Apply a random orthogonal transform from the Haar measure on . Because is orthogonal, the norm is preserved and each component satisfies .

3

Variance Normalization

Scale to unit variance so each scalar satisfies โ€” exactly matching the Lloyd-Max codebook distribution.

4

Lloyd-Max Quantization

Each scalar is independently quantized using the optimal boundaries. At 4 bits: 16 centroids, 15 decision boundaries.

5

Reconstruction

Undo the rotation and rescale by the stored norm to obtain the quantized approximation in the original coordinate space.

MSE Analysis

Because orthogonal rotation preserves the Frobenius norm, the per-element reconstruction error factors cleanly:

where is the distortion of the -bit Lloyd-Max quantizer on . The overall MSE is:

where is the average squared norm per weight element.

Lloyd-Max Distortion Values

b (bits)L (levels)DbSNR (dB)
120.36344.40
240.11759.30
380.0345414.62
4160.00949720.22
5320.00249926.02

Each additional bit roughly halves the distortion (~6 dB improvement).

1e-33e-30.010.030.10.312345Bits (b)Distortion (Dแตฆ)
Lloyd-Max DbShannon D*(R) = 2โˆ’2R

Near-Optimality

The Shannon rate-distortion function for at rate bits is:

At bits, . The Lloyd-Max quantizer achieves , giving a gap of only:

Why rotation makes this possible

Without rotation, trained neural network weights are correlated and non-Gaussian. Scalar quantization operating per-coordinate leaves inter-coordinate redundancy unexploited. The random rotation decorrelates coordinates and projects them onto i.i.d. approximate Gaussians โ€” reducing the problem to the case where scalar Lloyd-Max is near-optimal. The gap is only ~3.9 dB from the theoretical optimum and decreases for higher .

Residual Quantization

The single-pass error can be reduced by iteratively quantizing the reconstruction residual. Define the residual sequence:

After passes the total MSE is approximately:

Effective Bit-Rate Configurations

ConfigPassesBits / weightExpected MSE ratio
4-bit single14
4+4 residual28
4+2 residual26
2+2+2+248

Rotation Strategies

The choice of rotation seed(s) across residual passes has a significant impact on error decorrelation and inference efficiency.

StrategySeedsAdvantageDisadvantage
IndependentErrors projected onto different subspaces โ€” maximizes error reduction per passCannot merge passes in the rotated domain
SharedEnables fast-path merge โ€” reduces storage and inference cost to single-pass levelsCorrelated errors across passes; reduced benefit per pass
Alternating (two seeds)Adjacent passes use different bases โ€” near-independent error reduction with only two rotation matricesCannot use the fast-path merge

Inference

Instead of materializing the full weight matrix, inference operates in the rotated domain using the pre-rotate input trick.

For input and output , substitute the reconstruction:

Pre-rotating the input reduces the inner product to a lookup + dot product:

OperationCostComment
Input rotation or QR vs Hadamard
Codebook lookupIndex โ†’ centroid
Fused dot productSum of products
Norm rescalingMultiply by

Total: per forward pass โ€” same asymptotic cost as dense matmul, but with much smaller memory footprint.

Compact Summary

The TurboQuant quantization objective can be written compactly as:

where the single-pass operator is:

๐Ÿ”„
Rotation

Decorrelates weight coordinates โ†’ i.i.d. approximate ๐’ฉ(0, 1/d)

๐Ÿ“
Normalization

Matches the codebook's design distribution โ†’ ๐’ฉ(0, 1)

๐Ÿ“Š
Lloyd-Max

Optimal scalar quantizer for known distributions โ†’ minimal D_b

๐ŸŽฏ
Residual passes

Exploit remaining structure โ†’ multiplicative MSE reduction

โšก
On-the-fly dequant

Preserves the b-bit memory advantage at inference

BPW Analysis

Beyond MSE, minimizing bits per weight (BPW) determines model size on disk and memory footprint. The total storage per weight element decomposes as:

Variables That Affect BPW

VariableHow it affects BPWDefault
Index bit-width (b)Directly: BPW โˆ b4
Residual passes (P)Total index bits = ฮฃ bโ‚–1 or 2
Group size (d)Norm overhead = 32/d per norm set128
Norm precisionOverhead = p_norm/d (currently 32-bit)float32
Number of norm setsEach pass stores one set of normsP
Non-quantized layersEmbeddings, LayerNorm, lm_head at full precisionmodel-dep

Budget Example: Qwen3.5-0.8B (4+2 residual)

Qwen3.5-0.8B ยท 4+2 residual ยท d=128~7.02 BPW
Pass 1 indices 4 (57%)
Pass 2 indices 2 (28.5%)
Pass 1 norms 0.25 (3.6%)
Pass 2 norms 0.25 (3.6%)
Non-quantized 0.52 (7.3%)

Research Directions

Seven ideas to push effective BPW lower while preserving quality. Combined potential: with fp16 norms, , 3+2 config, and entropy coding, effective BPW could reach ~4.5 at quality comparable to current 4+2 (7 BPW).

๐Ÿ”ข
Reduce Norm Precision
โˆ’0.25 BPW

Store norms in float16/bfloat16 instead of float32, halving the overhead to BPW. Low risk for typical weight magnitudes.

๐Ÿ“
Larger Group Size
โˆ’0.13 BPW / pass

Increase from 128 to 256 or 512. The Gaussian approximation improves with larger (better concentration of measure).

โฌ‡๏ธ
Sub-4-bit Primary
โˆ’1.0 BPW

Use 3-bit for the primary pass and rely on residual passes for quality. A 3+2 config achieves 5 BPW with distortion comparable to 4-bit single.

๐ŸŽš๏ธ
Non-Uniform Bit Allocation
โˆ’0.5โ€“1.0 BPW

Assign higher bits to sensitive layers (attention Q/K) and lower bits to less sensitive ones (MLP). Solvable via dynamic programming.

๐Ÿ—œ๏ธ
Entropy Coding
โˆ’0.24 BPW

Lloyd-Max indices are non-uniform (inner levels more probable). Shannon entropy is ~3.76 bits vs 4 allocated. ANS or Huffman coding recovers the gap.

๐Ÿ“‰
Delta Norms
โˆ’0.19 BPW

Residual norms are much smaller. Store them as ratios of pass-1 norms, quantized to 8-bit.

๐Ÿšซ
Norm-Free Quantization
variable

Absorb the norm into a per-group scaled codebook: . Trades per-row norms for a single per-group scale factor.