🔓

Inference Dequantization Pipeline

Key insight: rotate the input, not the weight. Pre-rotating the activation avoids materializing the full weight matrix.

The Key Insight

Naively, dequantization would reconstruct the full weight matrix and compute . Instead, we pre-rotate the activation:

The rotation is applied to x once per group per layer — a (B, d) matrix multiply vs the (M, d) inverse rotation on the weight side.

Pipeline Overview

📥

Input x

(B, N) activation

🔄

Rotate x

x · Πᵀ (cheap: B×d)

📖

Unpack + Lookup

uint8 → codebook[idx]

✖️

Matmul

x_rot @ W_q.T

⚖️

Rescale

× α / √d

Forward Pass Algorithm

output = zeros(B, M)

for each group g in [0, n_groups):
    x_g   = x[:, g*d : (g+1)*d]           # (B, d)
    x_rot = x_g @ Pi_g.T                  # (B, d)  rotate input
    idx_g = unpack_4bit(packed[..., g])    # (M, d)  unpack
    W_g   = codebook[idx_g]               # (M, d)  lookup
    out_g = x_rot @ W_g.T                 # (B, M)  matmul
    out_g = out_g * (norms_g / sqrt(d))   # (B, M)  rescale
    output += out_g

Kernel Fusion

Steps 2–5 (unpack → lookup → matmul → rescale) are fused into a single GPU kernel to avoid intermediate tensor materialization.

Naive Pipeline (4 kernel launches)

📦 Unpack uint8 → int64Global Memory

↓ write + read

📖 Codebook lookup → float32Global Memory

↓ write + read

✖️ Matrix multiplyGlobal Memory

↓ write + read

⚖️ RescaleGlobal Memory

Fused Kernel (1 kernel launch)

📦 Load packed uint8Registers

↓ in-register

🔓 Unpack nibbles (bitwise)Registers

↓ in-register

📖 Codebook (64B in L1)Shared Mem

↓ in-register

✖️ Tensor Core MMA + RescaleRegisters

↓ in-register

💾 Store final result1× Global Write

Execution Paths

CuTileCUDA 13.1+, Ampere+ (sm80/sm89/sm100+)

NVIDIA cuda.tile_experimental API. Shared-memory codebook, FP16/BF16 tensor cores, tile-based prefetching.

TritonTriton ≥ 3.0

Portable alternative. Autotuned block sizes per problem shape, software pipelining, TF32 tensor cores.

PyTorch (fallback)No special dependencies

Explicit operations: unpack → codebook[indices] → matmul → rescale. Materializes dequantized weight slice.

Residual Pass Handling

When a layer has residual quantization, the forward method runs _forward_pass twice with different packed data and sums the results:

output  = _forward_pass(x, pass1_data)
output += _forward_pass(x, pass2_data)  # if residual
output += bias                           # if present

CPU Offload for Pass 2

When GPU VRAM is limited, pass 2 (residual) data can be offloaded to CPU while pass 1 stays on GPU. This halves the on-GPU quantized weight footprint with ~10% latency overhead from pipelined Host-to-Device copies.

Architecture

GPU (always resident)

• Pass 1 data (indices, norms, codebook)
• SharedScratchPool (2 double-buffered slots)
• Embedding (bf16 or INT4/INT8)
• Activations / KV cache

CPU (pinned memory)

• Pass 2 data (indices, norms, codebook)
• Async H2D via copy_stream per layer

Prefetch Chain (per layer)

Fence— Record CUDA event on default stream, make copy_stream wait

Async H2D— Copy next layer's pass2 data to alternate scratch slot via copy_stream

Pass 1 compute— Runs on default stream (overlaps with H2D copy)

Wait— Default stream waits for this layer's pass2 copy (started by previous layer)

Pass 2 compute— Uses the scratch slot now populated with pass2 data

# Enable at quantization time
config = TurboQuantConfig(..., cpu_offload_pass2=True)
model  = quantize_model(model, config)

# Or override at load time
model = load_quantized(model_name, path, cpu_offload_pass2=True)

Memory Profile

The pipeline never materializes the full M×N weight matrix. Peak additional memory:

Component	Size	Notes
x_rot	B × d × 4B	Per group, reused
W slice (PyTorch only)	M × d × 4B	Per group, reused
Output acc.	B × M × 4B	Persistent
Rotation matrix	d × d × 4B	Cached

With fused kernels, the dequantized weight slice only exists in registers/shared memory within the kernel — never written to global memory.

Implementation

module.py → TurboQuantLinear._forward_pass(), TurboQuantLinear.forward()

← Quantize Pipeline Quantization Formulation →