Welcome to KB Nano, your launchpad for engineering the world’s fastest GPU code. We provide a library of state-of-the-art GPU kernels and a rigorous benchmark suite to push the limits of hardware performance.Documentation Index
Fetch the complete documentation index at: https://snowflake-84d72a0d.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Why you’ll love KB Nano
- 🚀 Built from production — Kernel specs, signatures, and baseline implementations are derived directly from SOTA production frameworks — every kernel you write is a drop-in replacement, not a toy exercise
- 🧬 Massive Kernel Coverage: Stress-test 54 frontier architectures — benchmarking everything from Linear Attention and Gated Delta Net to DeepSeek DSA, World Models, Diffusion, and so much more — alongside the battle-tested ones powering production today
- 🔭 End-to-end benchmarking — Test a single kernel (L1) or measure its ripple effect all the way up to full-model inference (L4)
- 🛡️ Hack-proof validation — Property-based unit tests + end-to-end inference checks catch shortcuts that fool other benchmarks
- 🎯 Actionable profiling — Hardware-level feedback via NVIDIA Nsight Compute tells you exactly where to optimize
- 📏 Reproducible by design — Canonical dataset, standardized evaluation, deterministic results
Quickstart
Explore the baselines
Each supported model architecture is implemented in a modular fashion under
tasks/baseline/, organized in four levels from individual kernels (L1) up to full models (L4). See Architecture for details on the hierarchy and interface conventions.kb_nano/tasks/baseline
L1 — Single-kernel ops
L2 — Multi-op blocks
L3 — Decoder layers
L4 — Full models
Write your optimized version
Place your replacement in Your replacement must be an
kb_nano/tasks/candidate/L{level}/{op_name}.py. You can work at any abstraction level — swap a single kernel (L1), an attention block (L2), a decoder layer (L3), or an entire model (L4).To scaffold candidate files with the correct signatures and TODOs to fill in:nn.Module with the exact same class name and exact same forward signature as the baseline. You can use pure PyTorch, Triton, inline CUDA, or external libraries.Benchmark and iterate
kb-nano provides hierarchical benchmarking tools. Start by testing individual kernels in isolation, then measure their end-to-end impact on the full model:A kernel benchmark is PASS if the candidate output passes
allclose against the baseline. Speedup > 1.0 means your implementation is faster than the baseline.More benchmarking options