Skip to main content

Documentation Index

Fetch the complete documentation index at: https://snowflake-84d72a0d.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Welcome to KB Nano, your launchpad for engineering the world’s fastest GPU code. We provide a library of state-of-the-art GPU kernels and a rigorous benchmark suite to push the limits of hardware performance.

Why you’ll love KB Nano

  • 🚀 Built from production — Kernel specs, signatures, and baseline implementations are derived directly from SOTA production frameworks — every kernel you write is a drop-in replacement, not a toy exercise
  • 🧬 Massive Kernel Coverage: Stress-test 54 frontier architectures — benchmarking everything from Linear Attention and Gated Delta Net to DeepSeek DSA, World Models, Diffusion, and so much more — alongside the battle-tested ones powering production today
  • 🔭 End-to-end benchmarking — Test a single kernel (L1) or measure its ripple effect all the way up to full-model inference (L4)
  • 🛡️ Hack-proof validation — Property-based unit tests + end-to-end inference checks catch shortcuts that fool other benchmarks
  • 🎯 Actionable profiling — Hardware-level feedback via NVIDIA Nsight Compute tells you exactly where to optimize
  • 📏 Reproducible by design — Canonical dataset, standardized evaluation, deterministic results

Quickstart

1

Clone and install

git clone git@github.com:sfc-gh-goliaro/kb_nano.git
cd kb_nano
pip install .
2

Explore the baselines

Each supported model architecture is implemented in a modular fashion under tasks/baseline/, organized in four levels from individual kernels (L1) up to full models (L4). See Architecture for details on the hierarchy and interface conventions.
kb_nano/tasks/baseline
3

Write your optimized version

Place your replacement in kb_nano/tasks/candidate/L{level}/{op_name}.py. You can work at any abstraction level — swap a single kernel (L1), an attention block (L2), a decoder layer (L3), or an entire model (L4).To scaffold candidate files with the correct signatures and TODOs to fill in:
python agent/create_stubs.py                      # all operators
python agent/create_stubs.py --level 1             # L1 only
python agent/create_stubs.py --architecture llama  # Llama only
Your replacement must be an nn.Module with the exact same class name and exact same forward signature as the baseline. You can use pure PyTorch, Triton, inline CUDA, or external libraries.
4

Benchmark and iterate

kb-nano provides hierarchical benchmarking tools. Start by testing individual kernels in isolation, then measure their end-to-end impact on the full model:
# Kernel-level: correctness + speedup for a single operator
kb_nano kernels --target rms_norm

# E2E: throughput with your candidates swapped into the model
kb_nano eval \
    --model meta-llama/Llama-3.1-8B-Instruct
A kernel benchmark is PASS if the candidate output passes allclose against the baseline. Speedup > 1.0 means your implementation is faster than the baseline.More benchmarking options