Block-wise Calibration
Fine-tune per-group norms through each transformer block to minimize end-to-end reconstruction error โ recovering ~28% of the quantization gap.
Why Per-Layer Calibration Fails
After quantization, per-row norms are computed analytically. While optimal in a per-layer MSE sense, they don't account for error propagation โ the output error of block becomes the input error for block .
Per-layer calibration result
On Qwen3.5-0.8B-Base, per-layer calibration degraded end-to-end quality: PPL increased by +0.0656 and KLD by +0.005 โ even though per-layer MSE improved for every layer. Locally optimal norms can amplify errors when composed through the network.
Block-wise Objective
For each transformer block , given the quantized model's actual input at that block, optimize all norm vectors jointly:
MSE
Matches output magnitude โ the primary reconstruction signal.
Angular
Preserves direction โ critical for attention dot products.
KLD
Preserves distribution shape across the feature dimension.
Per-Group Exponential Parameterization
Instead of optimizing directly, we use a per-group multiplicative correction:
Identity at init
, so the initial solution is the analytical norm.
Per-group granularity
Each group's norm is independently adjusted โ 2ร better than per-row.
Always positive
for all , ensuring valid norms.
Algorithm
Pre-capture FP targets
Run one full-precision forward pass, capturing all block outputs. Then offload the FP model to CPU to free GPU memory.
Sequential block optimization
For each block : disable fused kernels, create learnable parameters, run AdamW for iterations against the FP target, fold the optimal correction into the norms, then restore fused kernels.
Propagate through calibrated blocks
After calibrating block , run a forward pass through the now-calibrated model to capture the actual input for block . This ensures each block sees the correct error landscape.
Results (Qwen3.5-0.8B-Base)
| Method | PPL | ฮPPL | KLD | ฮKLD | Time |
|---|---|---|---|---|---|
| Analytical norms | 13.9564 | โ | 0.1301 | โ | โ |
| Per-layer cal | 14.0220 | +0.066 | 0.1352 | +0.005 | ~35 min |
| Per-row blockwise (4s/50i) | 13.6971 | โ0.259 | 0.1170 | โ0.013 | 12.9 min |
| Per-group blockwise (4s/50i) โจ | 13.4427 | โ0.514 | 0.0959 | โ0.034 | 14.0 min |
bf16 baseline PPL: 12.1303. The quantization gap is . Per-group blockwise calibration recovers 28.1% of that gap.
Per-group is 2ร better than per-row
Different groups contribute unequally to output error. Per-group correction (MรG parameters) captures this structure while per-row (M parameters) applies a uniform scaling. The extra parameters cost only 8% more time.
When to Use
4-bit single-pass
Significant PPL and KLD improvement. Adds ~13 min for 0.8B models. Enabled with --calibrate.
4+4 residual
Already near-perfect (). Calibration has zero effect on PPL โ skip it and save hours.
Relationship to Other Techniques
Norm Compression
Calibration modifies the norm values; norm compression reduces their storage. Apply calibration first, then compress the calibrated norms.
Lloyd-Max Codebook
Calibration only adjusts norms โ the codebook and quantized indices remain fixed. This makes it a lightweight post-quantization step.