📐

Quantized Johnson-Lindenstrauss (QJL)

A 1-bit random projection technique for unbiased inner product estimation — elegant for KV-cache attention, but not the right tool for offline weight compression.

The Johnson-Lindenstrauss Lemma

The JL lemma (1984) states that any set of points in high-dimensional space can be embedded into dimensions while preserving all pairwise distances within a factor of .

The projection is a random linear map — a matrix with i.i.d. Gaussian or sub-Gaussian entries, scaled appropriately. This is the theoretical foundation behind QJL.

How QJL Works

QJL (Zandieh et al., 2024) takes the JL idea further: instead of storing the full projected coordinates, it keeps only the sign — just 1 bit per projection. Given random directions , the inner product estimator is:

Key Properties

🎯

Unbiased

💾

1 bit per projection

Store only sign(⟨r_i, v⟩)

⚡

Zero decode overhead

Sign comparisons via bitwise XOR + popcount

QJL in the TurboQuant Paper

The paper defines TurboQuant_prod, which combines standard TurboQuant with a QJL correction for an unbiased inner product estimator:

1Quantize the vector using TurboQuant (rotation + Lloyd-Max) →

2Compute residual

3Apply 1-bit QJL to for an unbiased correction:

This makes the overall estimator unbiased — critical for KV-cache attention where you quantize keys once and query with many different vectors over the sequence lifetime.

Why This Project Doesn't Use QJL

QJL is designed for a fundamentally different use case. Here are the four reasons we chose multi-pass residual quantization instead.

Different Problem: Online vs Offline

QJL is designed for online inner product estimation — quantize once, query many times with different vectors. Weight quantization is offline: we compress once and compute repeatedly. We want minimum reconstruction error , not an unbiased dot-product estimator.

Unbiasedness Is Unnecessary for Weights

A small deterministic bias from MSE-optimal quantization is absorbed by layer norms, residual connections, and softmax normalization. An unbiased but high-variance estimator (QJL at 1 bit) introduces stochastic noise that changes every forward pass — worse for stable inference.

Residual Quantization Strictly Dominates

QJL uses 1 bit (random sign projection) for the residual correction. Our residual pass uses bits with a full Lloyd-Max codebook + independent rotation — capturing far more residual information.

QJL correction

1 bit per weight

Random sign only

Residual TQ

4 bits per weight

Full Lloyd-Max codebook

At 4+4 total bits, residual TurboQuant achieves KL divergence of only 0.002 nats (practically lossless). A 1-bit QJL correction cannot compete.

QJL Requires the Query at Runtime

The QJL correction term depends on the input activation , making it incompatible with offline weight compression. You'd need to recompute corrections per forward pass — defeating the purpose of weight-only quantization.

Visual Comparison

TurboQuant_prod (Paper)

Pass 1: Lloyd-Max quantize (b₁ bits)

Pass 2: QJL 1-bit sign projection on residual

↓

Unbiased inner product estimator. Needs query x at runtime.

This Project (Residual TQ)

Pass 1: Full TQ: rotate + Lloyd-Max (4 bits)

Pass 2: Full TQ on residual (4 bits, new codebook)

↓

Near-lossless weight compression. Offline, no runtime dependency.

Summary

QJL is an elegant technique rooted in the JL lemma — perfect for streaming / KV-cache inner product preservation with 1-bit signed projections. For offline weight compression, multi-pass residual quantization with optimal scalar codebooks is the natural and superior choice — achieving practically lossless results at 4+4 bits with no runtime overhead.

References

QJL: Zandieh et al., "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead," 2024.

Johnson-Lindenstrauss: W. Johnson & J. Lindenstrauss, "Extensions of Lipschitz mappings into a Hilbert space," Contemporary Mathematics, 1984.

TurboQuant: Zandieh et al., "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate," arXiv:2504.19874, 2025.

← Fused Kernels Quantize Pipeline →