💾

CPU Offload (Pass 2)

Pipelined H2D streaming halves the VRAM cost of 4+4 residual — pass 2 lives on CPU, streamed to a shared double-buffered scratch pool via CUDA streams.

Why Offload?

In 4+4 residual quantization, each layer stores two full sets of packed indices, norms, and codebook — one per pass. Pass 2 approximately doubles the VRAM footprint compared to single-pass 4-bit.

For small-batch inference (batch 1), the GPU is compute-bound — PCIe bandwidth sits mostly idle. CPU offload exploits this by streaming pass 2 data from host memory to GPU on-demand.

For 24 equal-sized layers: ~92% reduction in pass 2 VRAM

Shared Double-Buffered Scratch

Instead of per-layer GPU scratch (which costs the same as just keeping pass 2 on GPU), a single SharedScratchPool holds 2 scratch slots sized to the largest offloaded layer. Layers are assigned alternating slots (ping-pong):

Slot 0

Even-indexed layers

Consumed while Slot 1 receives H2D

Slot 1

Odd-indexed layers

Consumed while Slot 0 receives H2D

Total GPU cost: — constant regardless of layer count. The rest lives in pinned CPU memory for DMA transfers at full PCIe bandwidth.

Data Layout

Buffer	Location	Scope	Purpose
Pass 1 (indices, norms, codebook)	GPU	Per-layer	Permanent, read by kernels
Pass 2 (pinned copies)	CPU	Per-layer	Source for async H2D
SharedScratchPool ✨	GPU	Global (shared)	2 ping-pong scratch slots

Execution Timeline

Two CUDA streams operate concurrently. The copy stream runs in parallel with compute on the default stream:

Single Layer (No Prefetch)

Copy stream:  ╠══ H2D pass2 ═══╣
Default:      ╠══ pass1 rot ═══╬══ pass1 kernel ═══╬═ wait ═╬══ pass2 kernel ═══╣
                                                      ↑
                                                 wait_event()

With Next-Layer Prefetch

Copy stream:  ╠═ H2D L₀ pass2 ═╬═ H2D L₁ pass2 ═╬═ H2D L₂ pass2 ═╣
Default:      ╠═ L₀ p1 ═╬ wait ╬═ L₀ p2 ═╬═ L₁ p1 ═╬ wait ╬═ L₁ p2 ═╣
                          ↑     prefetch L₁→           ↑
                     wait event₀                  wait event₁
                     (free)                       (usually free)

With Dual-Pass Fused Kernel

Copy stream:  ╠═ H2D pass2 ═══════╣
Default:      ╠═ rotations (both) ═╬═ wait ═╬═ dual_fused_kernel ═══╣
                                      ↑
                                 wait_event()

Latency Impact

For a typical layer (), the pass 2 H2D transfer takes ~0.08 ms on PCIe 4.0 x16. The pass 1 kernel takes 0.2–0.5 ms, so the copy is fully hidden.

Batch	Pass 1 Time	H2D Time	Overhead
1	0.3 ms	0.08 ms	0% (hidden)
8	0.8 ms	0.08 ms	0% (hidden)
32	2.5 ms	0.08 ms	0% (hidden)
128	9 ms	0.08 ms	0% (hidden)

H2D time is constant regardless of batch size — hidden at all practical workloads.

Memory Budget

For equal-sized layers with pass 2 size :

Mode	GPU (pass 2)	CPU
Non-offloaded		0
CPU offload ✨	(constant)	(pinned)

Qwen3.5-0.8B Example (24 layers)

Mode	VRAM (weights)	CPU (pinned)
bf16 baseline	~1.6 GB	—
4-bit single	~0.4 GB	—
4+4 residual	~0.8 GB	—
4+4 CPU offload ✨	~0.43 GB	~0.4 GB

Synchronization: CUDA Events

The pipeline uses CUDA events for device-side synchronization — no CPU blocking:

Copy stream records event

After H2D copy completes, record_event() marks the position in the copy stream.

Default stream waits on event

wait_event() is non-blocking to the CPU — only the GPU pauses if the copy isn't done yet.

Prefetch the next layer

At the end of each layer's forward, start H2D for the next layer onto the copy stream. By the time it's needed, the copy is likely complete.

This is critical: stream.synchronize() blocks the CPU, preventing it from submitting the next kernel. wait_event() keeps the CPU free.

Usage

CLI

turboquant quantize \
    --model Qwen/Qwen3-0.6B \
    --output ./quantized \
    --residual-bit-width 4 \
    --cpu-offload-pass2

Python API

from turboquant_model import (
    TurboQuantConfig,
    quantize_model,
    enable_prefetch_chain,
)

config = TurboQuantConfig(
    bit_width=4,
    residual_bit_width=4,
    cpu_offload_pass2=True,
)
model = quantize_model(model, config)
# enable_prefetch_chain() is called automatically

Relationship to Other Techniques

🎯