πŸ’Ύ

CPU Offload (Pass 2)

Pipelined H2D streaming halves the VRAM cost of 4+4 residual β€” pass 2 lives on CPU, streamed to a shared double-buffered scratch pool via CUDA streams.

Why Offload?

In 4+4 residual quantization, each layer stores two full sets of packed indices, norms, and codebook β€” one per pass. Pass 2 approximately doubles the VRAM footprint compared to single-pass 4-bit.

For small-batch inference (batch 1), the GPU is compute-bound β€” PCIe bandwidth sits mostly idle. CPU offload exploits this by streaming pass 2 data from host memory to GPU on-demand.

For 24 equal-sized layers: ~92% reduction in pass 2 VRAM

Shared Double-Buffered Scratch

Instead of per-layer GPU scratch (which costs the same as just keeping pass 2 on GPU), a single SharedScratchPool holds 2 scratch slots sized to the largest offloaded layer. Layers are assigned alternating slots (ping-pong):

Slot 0
Even-indexed layers
Consumed while Slot 1 receives H2D
Slot 1
Odd-indexed layers
Consumed while Slot 0 receives H2D

Total GPU cost: β€” constant regardless of layer count. The rest lives in pinned CPU memory for DMA transfers at full PCIe bandwidth.

Data Layout

BufferLocationScopePurpose
Pass 1 (indices, norms, codebook)GPUPer-layerPermanent, read by kernels
Pass 2 (pinned copies)CPUPer-layerSource for async H2D
SharedScratchPool ✨GPUGlobal (shared)2 ping-pong scratch slots

Execution Timeline

Two CUDA streams operate concurrently. The copy stream runs in parallel with compute on the default stream:

Single Layer (No Prefetch)

Copy stream:  ╠══ H2D pass2 ═══╣
Default:      ╠══ pass1 rot ═══╬══ pass1 kernel ═══╬═ wait ═╬══ pass2 kernel ═══╣
                                                      ↑
                                                 wait_event()

With Next-Layer Prefetch

Copy stream:  ╠═ H2D Lβ‚€ pass2 ═╬═ H2D L₁ pass2 ═╬═ H2D Lβ‚‚ pass2 ═╣
Default:      ╠═ Lβ‚€ p1 ═╬ wait ╬═ Lβ‚€ p2 ═╬═ L₁ p1 ═╬ wait ╬═ L₁ p2 ═╣
                          ↑     prefetch L₁→           ↑
                     wait eventβ‚€                  wait event₁
                     (free)                       (usually free)

With Dual-Pass Fused Kernel

Copy stream:  ╠═ H2D pass2 ═══════╣
Default:      ╠═ rotations (both) ═╬═ wait ═╬═ dual_fused_kernel ═══╣
                                      ↑
                                 wait_event()

Latency Impact

For a typical layer (), the pass 2 H2D transfer takes ~0.08 ms on PCIe 4.0 x16. The pass 1 kernel takes 0.2–0.5 ms, so the copy is fully hidden.

BatchPass 1 TimeH2D TimeOverhead
10.3 ms0.08 ms0% (hidden)
80.8 ms0.08 ms0% (hidden)
322.5 ms0.08 ms0% (hidden)
1289 ms0.08 ms0% (hidden)

H2D time is constant regardless of batch size β€” hidden at all practical workloads.

Memory Budget

For equal-sized layers with pass 2 size :

ModeGPU (pass 2)CPU
Non-offloaded0
CPU offload ✨ (constant) (pinned)

Qwen3.5-0.8B Example (24 layers)

ModeVRAM (weights)CPU (pinned)
bf16 baseline~1.6 GBβ€”
4-bit single~0.4 GBβ€”
4+4 residual~0.8 GBβ€”
4+4 CPU offload ✨~0.43 GB~0.4 GB

Synchronization: CUDA Events

The pipeline uses CUDA events for device-side synchronization β€” no CPU blocking:

1

Copy stream records event

After H2D copy completes, record_event() marks the position in the copy stream.

2

Default stream waits on event

wait_event() is non-blocking to the CPU β€” only the GPU pauses if the copy isn't done yet.

3

Prefetch the next layer

At the end of each layer's forward, start H2D for the next layer onto the copy stream. By the time it's needed, the copy is likely complete.

This is critical: stream.synchronize() blocks the CPU, preventing it from submitting the next kernel. wait_event() keeps the CPU free.

Usage

CLI

turboquant quantize \
    --model Qwen/Qwen3-0.6B \
    --output ./quantized \
    --residual-bit-width 4 \
    --cpu-offload-pass2

Python API

from turboquant_model import (
    TurboQuantConfig,
    quantize_model,
    enable_prefetch_chain,
)

config = TurboQuantConfig(
    bit_width=4,
    residual_bit_width=4,
    cpu_offload_pass2=True,
)
model = quantize_model(model, config)
# enable_prefetch_chain() is called automatically

Relationship to Other Techniques

🎯

Residual Quantization

CPU offload is specifically designed for 4+4 residual β€” it offloads the pass 2 data while pass 1 remains on GPU.

⚑

Fused Kernels

The dual-pass fused kernel integrates with the offload pipeline, consuming scratch buffers directly in a single kernel launch.

πŸ“

Norm Compression

Compressed norms reduce the H2D transfer size per layer, making the offload pipeline even cheaper.

πŸ—œοΈ

Entropy Coding

Entropy-coded indices must be decoded before the kernel can use them β€” this decoding happens before the H2D copy.