🎯

Block-wise Calibration

Fine-tune per-group norms through each transformer block to minimize end-to-end reconstruction error — recovering ~28% of the quantization gap.

Why Per-Layer Calibration Fails

After quantization, per-row norms are computed analytically. While optimal in a per-layer MSE sense, they don't account for error propagation — the output error of block becomes the input error for block .

Per-layer calibration result

On Qwen3.5-0.8B-Base, per-layer calibration degraded end-to-end quality: PPL increased by +0.0656 and KLD by +0.005 — even though per-layer MSE improved for every layer. Locally optimal norms can amplify errors when composed through the network.

Block-wise Objective

For each transformer block , given the quantized model's actual input at that block, optimize all norm vectors jointly:

📐

MSE

Matches output magnitude — the primary reconstruction signal.

🧭

Angular

Preserves direction — critical for attention dot products.

📊

KLD

Preserves distribution shape across the feature dimension.

Per-Group Exponential Parameterization

Instead of optimizing directly, we use a per-group multiplicative correction:

Identity at init

, so the initial solution is the analytical norm.

Per-group granularity

Each group's norm is independently adjusted — 2× better than per-row.

Always positive

for all , ensuring valid norms.

Algorithm

Pre-capture FP targets

Run one full-precision forward pass, capturing all block outputs. Then offload the FP model to CPU to free GPU memory.

Sequential block optimization

For each block : disable fused kernels, create learnable parameters, run AdamW for iterations against the FP target, fold the optimal correction into the norms, then restore fused kernels.

Propagate through calibrated blocks

After calibrating block , run a forward pass through the now-calibrated model to capture the actual input for block . This ensures each block sees the correct error landscape.

Results (Qwen3.5-0.8B-Base)

Method	PPL	ΔPPL	KLD	ΔKLD	Time
Analytical norms	13.9564	—	0.1301	—	—
Per-layer cal	14.0220	+0.066	0.1352	+0.005	~35 min
Per-row blockwise (4s/50i)	13.6971	−0.259	0.1170	−0.013	12.9 min
Per-group blockwise (4s/50i) ✨	13.4427	−0.514	0.0959	−0.034	14.0 min

bf16 baseline PPL: 12.1303. The quantization gap is . Per-group blockwise calibration recovers 28.1% of that gap.

Per-group is 2× better than per-row

Different groups contribute unequally to output error. Per-group correction (M×G parameters) captures this structure while per-row (M parameters) applies a uniform scaling. The extra parameters cost only 8% more time.

When to Use

✅

4-bit single-pass

Significant PPL and KLD improvement. Adds ~13 min for 0.8B models. Enabled with --calibrate.

⏭️

4+4 residual

Already near-perfect (). Calibration has zero effect on PPL — skip it and save hours.

Relationship to Other Techniques

📐

Norm Compression

Calibration modifies the norm values; norm compression reduces their storage. Apply calibration first, then compress the calibrated norms.

🧮

Lloyd-Max Codebook

Calibration only adjusts norms — the codebook and quantized indices remain fixed. This makes it a lightweight post-quantization step.

← Norm Compression QJL →