⚙️

Quantization Pipeline

Compressing each nn.Linear weight matrix W ∈ ℝᴹˣᴺ from bf16/fp32 to 4-bit packed indices in five steps.

Pipeline Overview

📐
Normalize
W / ‖W‖₂ → unit norm
🔄
Rotate
W·Πᵀ → 𝒩(0, 1/d)
📏
Scale
× √d → 𝒩(0, 1)
🎯
Quantize
Lloyd-Max → 4-bit idx
📦
Pack
2 indices → 1 byte

Step-by-Step

1

Row Normalization

Each row of the group slice is divided by its ℓ₂-norm. The norm α is stored separately and applied during inference.

2

Random Rotation

A random orthogonal matrix Πg decorrelates the weight coordinates. After rotation, each coordinate is approximately i.i.d. 𝒩(0, 1/d).

3

Scaling

Multiplying by √d brings coordinates to unit variance: 𝒩(0, 1) — exactly matching the Lloyd-Max codebook.

4

Scalar Quantization

Each scalar coordinate is independently quantized using the Lloyd-Max optimal boundaries for 𝒩(0,1). At 4 bits: 16 centroids, 15 decision boundaries.

5

4-bit Packing

Consecutive pairs of 4-bit indices are packed into a single uint8 byte, halving the storage for the index tensor.

Compression Visualization

Watch the weight matrix transform through each stage of the pipeline:

Output Format

ComponentShapeDtypePurpose
indices_packed(M, N/2)uint8Two 4-bit codebook indices per byte
weight_norms(M,) or (M, G)float32Row or group norms for rescaling
codebook(2ᵇ,)float32Lloyd-Max centroids (shared globally)
seedscalarintRotation seed for reproducibility

Implementation Entry Points

quantize.py turboquant_quantize_packed() (standalone)
model.py quantize_model() (full model)