๐
Fused GPU Kernels
CuTile and Triton kernels fuse unpack โ lookup โ matmul โ rescale into a single launch. The 64-byte codebook lives in registers.
The Problem: Intermediate Materialization
The naive dequantization pipeline creates multiple intermediate tensors, each requiring a separate kernel launch with a global memory round-trip. For large models, this intermediate materialization dominates both latency and memory.
Naive vs Fused: Side by Side
Naive Pipeline (4 kernel launches)
๐ฆ Unpack uint8 โ int64Global Memory
โ write + read
๐ Codebook lookup โ float32Global Memory
โ write + read
โ๏ธ Matrix multiplyGlobal Memory
โ write + read
โ๏ธ RescaleGlobal Memory
Fused Kernel (1 kernel launch)
๐ฆ Load packed uint8Registers
โ in-register
๐ Unpack nibbles (bitwise)Registers
โ in-register
๐ Codebook (64B in L1)Shared Mem
โ in-register
โ๏ธ Tensor Core MMA + RescaleRegisters
โ in-register
๐พ Store final result1ร Global Write
Kernel Algorithm
Each thread block computes a tile of the output. The codebook (16 ร 4 bytes = 64 bytes) fits entirely in registers or L1 cache, making the lookup essentially free:
1
Load packed uint8 bytes from global memory
Global โ Registers2
Unpack nibbles via bitwise ops (& 0x0F, >> 4)
In Registers3
Lookup codebook values (64 bytes in shared memory)
Shared Memory4
MMA tensor core multiply-accumulate (TF32/FP16)
Tensor Cores5
Rescale by pre-computed norms / โd
In Registers6
Store final result to global memory
Registers โ GlobalExecution Paths
CuTile
Fastest (CUDA 13.1+, Ampere+)
Triton
Portable (Triton 3.0+)
PyTorch
Fallback (no deps)
Performance Impact
3.98ร
CuTile Speedup
vs PyTorch (Qwen3.5-4B)
5.7ร
Memory Reduction
CuTile vs PyTorch
64
Codebook in Cache
bytes (fits in L1)
Implementation
cutile_kernels.py โ cutile_fused_matmul()
triton_kernels.py โ triton_fused_matmul()