📐

Norm Compression

Rank-1 SVD factorization with int8 residual reduces norm storage by ~4× — targeting the second-largest BPW term.

Where It Fits: The Norm Term

In the quantization formulation, the norm tensor is the second-largest storage component after quantized indices:

At with float32 norms, each pass contributes BPW. With two residual passes, this becomes 0.50 BPW — a significant fraction of the total budget that norm compression directly reduces.

Rank-1 Factorization

The norm tensor has strong low-rank structure: rows of the same layer tend to have similar magnitude patterns across groups. This motivates a rank-1 SVD approximation with a small int8 correction:

β_m

Row scale (float16)

Per-row magnitude

γ_g

Group scale (float16)

Per-group pattern

ε_m,g

Residual (int8)

Fractional correction

How It Works

SVD of the norm matrix

Compute and take the first singular vector: , .

Compute fractional residual

Quantize residual to int8

Symmetric quantization: , then . Typically of the norm value.

Storage Comparison

Method	Components	BPW (d=128)
float32 (baseline)	bits	0.250
float16	bits	0.125
Factored int8 ✨		~0.063

The factored representation achieves ~4× compression vs float32 norms, saving ~0.19 BPW per pass.

Reconstruction

The reconstruction error is bounded by the int8 quantization granularity: , which is typically less than 0.5% of the norm value.

Relationship to Other Techniques

🔄

Row Normalization (Step 1)

The norm tensor is produced during the normalization step of the quantization pipeline. This codec compresses that output.

🎯

Residual Quantization

Each residual pass produces its own norm tensor. Factorization is especially beneficial here since residual norms are highly structured.

🗜️

Entropy Coding

Entropy coding compresses the index tensor; norm factorization compresses the norm tensor. Together they address both major storage components.

📐

BPW Budget

At , switching from float32 to factored int8 saves ~0.19 BPW per pass — directly reducing the norm overhead in the formulation.

Implementation

norm_compression.py → factorize_norms() (rank-1 SVD + int8 residual)

norm_compression.py → reconstruct_norms() (reconstruct from factored form)

norm_compression.py → norm_bpw() (compute BPW overhead for any method)

← Entropy Coding Block-wise Calibration →