๐ŸŽฏ

Block-wise Calibration

Fine-tune per-group norms through each transformer block to minimize end-to-end reconstruction error โ€” recovering ~28% of the quantization gap.

Why Per-Layer Calibration Fails

After quantization, per-row norms are computed analytically. While optimal in a per-layer MSE sense, they don't account for error propagation โ€” the output error of block becomes the input error for block .

Per-layer calibration result

On Qwen3.5-0.8B-Base, per-layer calibration degraded end-to-end quality: PPL increased by +0.0656 and KLD by +0.005 โ€” even though per-layer MSE improved for every layer. Locally optimal norms can amplify errors when composed through the network.

Block-wise Objective

For each transformer block , given the quantized model's actual input at that block, optimize all norm vectors jointly:

๐Ÿ“

MSE

Matches output magnitude โ€” the primary reconstruction signal.

๐Ÿงญ

Angular

Preserves direction โ€” critical for attention dot products.

๐Ÿ“Š

KLD

Preserves distribution shape across the feature dimension.

Per-Group Exponential Parameterization

Instead of optimizing directly, we use a per-group multiplicative correction:

Identity at init

, so the initial solution is the analytical norm.

Per-group granularity

Each group's norm is independently adjusted โ€” 2ร— better than per-row.

Always positive

for all , ensuring valid norms.

Algorithm

1

Pre-capture FP targets

Run one full-precision forward pass, capturing all block outputs. Then offload the FP model to CPU to free GPU memory.

2

Sequential block optimization

For each block : disable fused kernels, create learnable parameters, run AdamW for iterations against the FP target, fold the optimal correction into the norms, then restore fused kernels.

3

Propagate through calibrated blocks

After calibrating block , run a forward pass through the now-calibrated model to capture the actual input for block . This ensures each block sees the correct error landscape.

Results (Qwen3.5-0.8B-Base)

MethodPPLฮ”PPLKLDฮ”KLDTime
Analytical norms13.9564โ€”0.1301โ€”โ€”
Per-layer cal14.0220+0.0660.1352+0.005~35 min
Per-row blockwise (4s/50i)13.6971โˆ’0.2590.1170โˆ’0.01312.9 min
Per-group blockwise (4s/50i) โœจ13.4427โˆ’0.5140.0959โˆ’0.03414.0 min

bf16 baseline PPL: 12.1303. The quantization gap is . Per-group blockwise calibration recovers 28.1% of that gap.

Per-group is 2ร— better than per-row

Different groups contribute unequally to output error. Per-group correction (Mร—G parameters) captures this structure while per-row (M parameters) applies a uniform scaling. The extra parameters cost only 8% more time.

When to Use

โœ…

4-bit single-pass

Significant PPL and KLD improvement. Adds ~13 min for 0.8B models. Enabled with --calibrate.

โญ๏ธ

4+4 residual

Already near-perfect (). Calibration has zero effect on PPL โ€” skip it and save hours.

Relationship to Other Techniques

๐Ÿ“

Norm Compression

Calibration modifies the norm values; norm compression reduces their storage. Apply calibration first, then compress the calibrated norms.

๐Ÿงฎ

Lloyd-Max Codebook

Calibration only adjusts norms โ€” the codebook and quantized indices remain fixed. This makes it a lightweight post-quantization step.