⚙️

Quantization Pipeline

Compressing each nn.Linear weight matrix W ∈ ℝᴹˣᴺ from bf16/fp32 to 4-bit packed indices in five steps.

Pipeline Overview

📐

Normalize

W / ‖W‖₂ → unit norm

🔄

Rotate

W·Πᵀ → 𝒩(0, 1/d)

📏

Scale

× √d → 𝒩(0, 1)

🎯

Quantize

Lloyd-Max → 4-bit idx

📦

Pack

2 indices → 1 byte

Each row of the group slice is divided by its ℓ₂-norm. The norm α is stored separately and applied during inference.

A random orthogonal matrix Πg decorrelates the weight coordinates. After rotation, each coordinate is approximately i.i.d. 𝒩(0, 1/d).

Multiplying by √d brings coordinates to unit variance: 𝒩(0, 1) — exactly matching the Lloyd-Max codebook.

Each scalar coordinate is independently quantized using the Lloyd-Max optimal boundaries for 𝒩(0,1). At 4 bits: 16 centroids, 15 decision boundaries.

Consecutive pairs of 4-bit indices are packed into a single uint8 byte, halving the storage for the index tensor.

Watch the weight matrix transform through each stage of the pipeline:

Component	Shape	Dtype	Purpose
indices_packed	(M, N/2)	uint8	Two 4-bit codebook indices per byte
weight_norms	(M,) or (M, G)	float32	Row or group norms for rescaling
codebook	(2ᵇ,)	float32	Lloyd-Max centroids (shared globally)
seed	scalar	int	Rotation seed for reproducibility

quantize.py → turboquant_quantize_packed() (standalone)

model.py → quantize_model() (full model)