Quantization Pipeline
Compressing each nn.Linear weight matrix W ∈ ℝᴹˣᴺ from bf16/fp32 to 4-bit packed indices in five steps.
Pipeline Overview
Step-by-Step
Row Normalization
Each row of the group slice is divided by its ℓ₂-norm. The norm α is stored separately and applied during inference.
Random Rotation
A random orthogonal matrix Πg decorrelates the weight coordinates. After rotation, each coordinate is approximately i.i.d. 𝒩(0, 1/d).
Scaling
Multiplying by √d brings coordinates to unit variance: 𝒩(0, 1) — exactly matching the Lloyd-Max codebook.
Scalar Quantization
Each scalar coordinate is independently quantized using the Lloyd-Max optimal boundaries for 𝒩(0,1). At 4 bits: 16 centroids, 15 decision boundaries.
4-bit Packing
Consecutive pairs of 4-bit indices are packed into a single uint8 byte, halving the storage for the index tensor.
Compression Visualization
Watch the weight matrix transform through each stage of the pipeline:
Output Format
| Component | Shape | Dtype | Purpose |
|---|---|---|---|
| indices_packed | (M, N/2) | uint8 | Two 4-bit codebook indices per byte |
| weight_norms | (M,) or (M, G) | float32 | Row or group norms for rescaling |
| codebook | (2ᵇ,) | float32 | Lloyd-Max centroids (shared globally) |
| seed | scalar | int | Rotation seed for reproducibility |