CPU Offload (Pass 2)
Pipelined H2D streaming halves the VRAM cost of 4+4 residual β pass 2 lives on CPU, streamed to a shared double-buffered scratch pool via CUDA streams.
Why Offload?
In 4+4 residual quantization, each layer stores two full sets of packed indices, norms, and codebook β one per pass. Pass 2 approximately doubles the VRAM footprint compared to single-pass 4-bit.
For small-batch inference (batch 1), the GPU is compute-bound β PCIe bandwidth sits mostly idle. CPU offload exploits this by streaming pass 2 data from host memory to GPU on-demand.
For 24 equal-sized layers: ~92% reduction in pass 2 VRAM
Shared Double-Buffered Scratch
Instead of per-layer GPU scratch (which costs the same as just keeping pass 2 on GPU), a single SharedScratchPool holds 2 scratch slots sized to the largest offloaded layer. Layers are assigned alternating slots (ping-pong):
Total GPU cost: β constant regardless of layer count. The rest lives in pinned CPU memory for DMA transfers at full PCIe bandwidth.
Data Layout
| Buffer | Location | Scope | Purpose |
|---|---|---|---|
| Pass 1 (indices, norms, codebook) | GPU | Per-layer | Permanent, read by kernels |
| Pass 2 (pinned copies) | CPU | Per-layer | Source for async H2D |
| SharedScratchPool β¨ | GPU | Global (shared) | 2 ping-pong scratch slots |
Execution Timeline
Two CUDA streams operate concurrently. The copy stream runs in parallel with compute on the default stream:
Single Layer (No Prefetch)
Copy stream: β ββ H2D pass2 ββββ£
Default: β ββ pass1 rot ββββ¬ββ pass1 kernel ββββ¬β wait ββ¬ββ pass2 kernel ββββ£
β
wait_event()With Next-Layer Prefetch
Copy stream: β β H2D Lβ pass2 ββ¬β H2D Lβ pass2 ββ¬β H2D Lβ pass2 ββ£
Default: β β Lβ p1 ββ¬ wait β¬β Lβ p2 ββ¬β Lβ p1 ββ¬ wait β¬β Lβ p2 ββ£
β prefetch Lββ β
wait eventβ wait eventβ
(free) (usually free)With Dual-Pass Fused Kernel
Copy stream: β β H2D pass2 ββββββββ£
Default: β β rotations (both) ββ¬β wait ββ¬β dual_fused_kernel ββββ£
β
wait_event()Latency Impact
For a typical layer (), the pass 2 H2D transfer takes ~0.08 ms on PCIe 4.0 x16. The pass 1 kernel takes 0.2β0.5 ms, so the copy is fully hidden.
| Batch | Pass 1 Time | H2D Time | Overhead |
|---|---|---|---|
| 1 | 0.3 ms | 0.08 ms | 0% (hidden) |
| 8 | 0.8 ms | 0.08 ms | 0% (hidden) |
| 32 | 2.5 ms | 0.08 ms | 0% (hidden) |
| 128 | 9 ms | 0.08 ms | 0% (hidden) |
H2D time is constant regardless of batch size β hidden at all practical workloads.
Memory Budget
For equal-sized layers with pass 2 size :
| Mode | GPU (pass 2) | CPU |
|---|---|---|
| Non-offloaded | 0 | |
| CPU offload β¨ | (constant) | (pinned) |
Qwen3.5-0.8B Example (24 layers)
| Mode | VRAM (weights) | CPU (pinned) |
|---|---|---|
| bf16 baseline | ~1.6 GB | β |
| 4-bit single | ~0.4 GB | β |
| 4+4 residual | ~0.8 GB | β |
| 4+4 CPU offload β¨ | ~0.43 GB | ~0.4 GB |
Synchronization: CUDA Events
The pipeline uses CUDA events for device-side synchronization β no CPU blocking:
Copy stream records event
After H2D copy completes, record_event() marks the position in the copy stream.
Default stream waits on event
wait_event() is non-blocking to the CPU β only the GPU pauses if the copy isn't done yet.
Prefetch the next layer
At the end of each layer's forward, start H2D for the next layer onto the copy stream. By the time it's needed, the copy is likely complete.
This is critical: stream.synchronize() blocks the CPU, preventing it from submitting the next kernel. wait_event() keeps the CPU free.
Usage
CLI
turboquant quantize \
--model Qwen/Qwen3-0.6B \
--output ./quantized \
--residual-bit-width 4 \
--cpu-offload-pass2Python API
from turboquant_model import (
TurboQuantConfig,
quantize_model,
enable_prefetch_chain,
)
config = TurboQuantConfig(
bit_width=4,
residual_bit_width=4,
cpu_offload_pass2=True,
)
model = quantize_model(model, config)
# enable_prefetch_chain() is called automaticallyRelationship to Other Techniques
Residual Quantization
CPU offload is specifically designed for 4+4 residual β it offloads the pass 2 data while pass 1 remains on GPU.
Fused Kernels
The dual-pass fused kernel integrates with the offload pipeline, consuming scratch buffers directly in a single kernel launch.
Norm Compression
Compressed norms reduce the H2D transfer size per layer, making the offload pipeline even cheaper.
Entropy Coding
Entropy-coded indices must be decoded before the kernel can use them β this decoding happens before the H2D copy.