๐Ÿš€

Fused GPU Kernels

CuTile and Triton kernels fuse unpack โ†’ lookup โ†’ matmul โ†’ rescale into a single launch. The 64-byte codebook lives in registers.

The Problem: Intermediate Materialization

The naive dequantization pipeline creates multiple intermediate tensors, each requiring a separate kernel launch with a global memory round-trip. For large models, this intermediate materialization dominates both latency and memory.

Naive vs Fused: Side by Side

Naive Pipeline (4 kernel launches)

๐Ÿ“ฆ Unpack uint8 โ†’ int64Global Memory
โ†“ write + read
๐Ÿ“– Codebook lookup โ†’ float32Global Memory
โ†“ write + read
โœ–๏ธ Matrix multiplyGlobal Memory
โ†“ write + read
โš–๏ธ RescaleGlobal Memory

Fused Kernel (1 kernel launch)

๐Ÿ“ฆ Load packed uint8Registers
โ†“ in-register
๐Ÿ”“ Unpack nibbles (bitwise)Registers
โ†“ in-register
๐Ÿ“– Codebook (64B in L1)Shared Mem
โ†“ in-register
โœ–๏ธ Tensor Core MMA + RescaleRegisters
โ†“ in-register
๐Ÿ’พ Store final result1ร— Global Write

Kernel Algorithm

Each thread block computes a tile of the output. The codebook (16 ร— 4 bytes = 64 bytes) fits entirely in registers or L1 cache, making the lookup essentially free:

1
Load packed uint8 bytes from global memory
Global โ†’ Registers
2
Unpack nibbles via bitwise ops (& 0x0F, >> 4)
In Registers
3
Lookup codebook values (64 bytes in shared memory)
Shared Memory
4
MMA tensor core multiply-accumulate (TF32/FP16)
Tensor Cores
5
Rescale by pre-computed norms / โˆšd
In Registers
6
Store final result to global memory
Registers โ†’ Global

Execution Paths

CuTile
Fastest (CUDA 13.1+, Ampere+)
โ†’
Triton
Portable (Triton 3.0+)
โ†’
PyTorch
Fallback (no deps)

Performance Impact

3.98ร—
CuTile Speedup
vs PyTorch (Qwen3.5-4B)
5.7ร—
Memory Reduction
CuTile vs PyTorch
64
Codebook in Cache
bytes (fits in L1)

Implementation

cutile_kernels.py โ†’ cutile_fused_matmul()
triton_kernels.py โ†’ triton_fused_matmul()