🚀

Fused GPU Kernels

CuTile and Triton kernels fuse unpack → lookup → matmul → rescale into a single launch. The 64-byte codebook lives in registers.

The Problem: Intermediate Materialization

The naive dequantization pipeline creates multiple intermediate tensors, each requiring a separate kernel launch with a global memory round-trip. For large models, this intermediate materialization dominates both latency and memory.

Naive vs Fused: Side by Side

Naive Pipeline (4 kernel launches)

📦 Unpack uint8 → int64Global Memory

↓ write + read

📖 Codebook lookup → float32Global Memory

↓ write + read

✖️ Matrix multiplyGlobal Memory

↓ write + read

⚖️ RescaleGlobal Memory

Fused Kernel (1 kernel launch)

📦 Load packed uint8Registers

↓ in-register

🔓 Unpack nibbles (bitwise)Registers

↓ in-register

📖 Codebook (64B in L1)Shared Mem

↓ in-register

✖️ Tensor Core MMA + RescaleRegisters

↓ in-register

💾 Store final result1× Global Write

Kernel Algorithm

Each thread block computes a tile of the output. The codebook (16 × 4 bytes = 64 bytes) fits entirely in registers or L1 cache, making the lookup essentially free:

Load packed uint8 bytes from global memory

Global → Registers

Unpack nibbles via bitwise ops (& 0x0F, >> 4)

In Registers

Lookup codebook values (64 bytes in shared memory)

Shared Memory

MMA tensor core multiply-accumulate (TF32/FP16)

Tensor Cores

Rescale by pre-computed norms / √d

In Registers

Store final result to global memory

Registers → Global

Execution Paths

CuTile

Fastest (CUDA 13.1+, Ampere+)

→

Triton

Portable (Triton 3.0+)

→

PyTorch

Fallback (no deps)

Performance Impact

3.98×

CuTile Speedup

vs PyTorch (Qwen3.5-4B)

5.7×

Memory Reduction

CuTile vs PyTorch

Codebook in Cache

bytes (fits in L1)

Implementation

cutile_kernels.py → cutile_fused_matmul()

triton_kernels.py → triton_fused_matmul()

← 4-bit Packing QJL →