🗜️

Entropy Coding (rANS)

Compress quantized indices below their nominal bit-width by exploiting non-uniform Gaussian bin probabilities.

Where It Fits: The Index Term

In the quantization formulation, the dominant term in the storage budget is the index tensor at bits per weight:

Entropy coding targets the first term. Because Lloyd-Max quantization of produces non-uniform bin probabilities (inner levels are more probable than outer), the Shannon entropy is strictly less than .

Entropy Gap

b (bits)	Levels	H (bits/sym)	Saving
2	4	1.911	−0.089
3	8	2.832	−0.168
4	16	3.764	−0.236
5	32	4.755	−0.245

At 4 bits, entropy coding saves ~0.24 BPW — bringing the index cost from 4.0 to ~3.76 bits per weight.

How rANS Works

Asymmetric Numeral Systems (Duda 2009) achieve near-entropy-optimal compression with a simple, GPU-friendly decode loop. Symbols are split into blocks of for independent parallel decoding.

Encode (sequential per block)

Process symbols in reverse. For symbol with frequency and cumulative :

— frequencies are quantized to sum to .

Decode (GPU-parallel per block)

Each block starts from a known 4-byte state. Per symbol:

1.slot = state & (2^P − 1)

2.symbol = LUT[slot] // O(1) table lookup

3.state = f_s × (state ≫ P) + slot − c_s

4.renormalize: read bytes while state < 2¹⁶

Decode Table Size

The entire decode table fits comfortably in GPU shared memory or registers:

Frequency table

bytes (uint16). At 4-bit: 32 bytes.

Cumulative table

bytes (uint32). At 4-bit: 68 bytes.

Total: ~100 bytes for 4-bit — derived from the known Gaussian bin probabilities, no training data needed.

Relationship to Other Techniques

📊

Lloyd-Max

Entropy coding exploits the non-uniform bin probabilities from optimal Gaussian quantization. Uniform quantizers would have (no saving).

📦

4-bit Packing

Packing reduces storage by fitting two indices per byte. Entropy coding goes further by exploiting statistical redundancy within those indices.

🎯

Residual Quantization

Each residual pass produces its own index tensor — entropy coding applies independently to each pass.

📐

Norm Compression

Entropy coding compresses the index tensor; norm factorization compresses the norm tensor. Together they address both major storage components.

Implementation

entropy_codec.py → gaussian_bin_probs() (compute bin probabilities from Lloyd-Max)

entropy_codec.py → compute_entropy() (theoretical entropy lower bound)

entropy_codec.py → build_ans_table() (frequency + cumulative tables)

entropy_codec.py → rANSCodec.encode() / decode() (block-parallel rANS)

← Fused GPU Kernels Norm Compression →