Intro to KB Nano

Welcome to KB Nano, your launchpad for engineering the world’s fastest GPU code. We provide a library of state-of-the-art GPU kernels and a rigorous benchmark suite to push the limits of hardware performance.

Why you’ll love KB Nano

🚀 Built from production — Kernel specs, signatures, and baseline implementations are derived directly from SOTA production frameworks — every kernel you write is a drop-in replacement, not a toy exercise
🧬 Massive Kernel Coverage: Stress-test 54 frontier architectures — benchmarking everything from Linear Attention and Gated Delta Net to DeepSeek DSA, World Models, Diffusion, and so much more — alongside the battle-tested ones powering production today
🔭 End-to-end benchmarking — Test a single kernel (L1) or measure its ripple effect all the way up to full-model inference (L4)
🛡️ Hack-proof validation — Property-based unit tests + end-to-end inference checks catch shortcuts that fool other benchmarks
🎯 Actionable profiling — Hardware-level feedback via NVIDIA Nsight Compute tells you exactly where to optimize
📏 Reproducible by design — Canonical dataset, standardized evaluation, deterministic results

Quickstart

Clone and install

git clone git@github.com:sfc-gh-goliaro/kb_nano.git
cd kb_nano
pip install .

Explore the baselines

Each supported model architecture is implemented in a modular fashion under tasks/baseline/, organized in four levels from individual kernels (L1) up to full models (L4). See Architecture for details on the hierarchy and interface conventions.

kb_nano/tasks/baseline

L1 — Single-kernel ops

L2 — Multi-op blocks

L3 — Decoder layers

L4 — Full models

Write your optimized version

Place your replacement in kb_nano/tasks/candidate/L{level}/{op_name}.py. You can work at any abstraction level — swap a single kernel (L1), an attention block (L2), a decoder layer (L3), or an entire model (L4).To scaffold candidate files with the correct signatures and TODOs to fill in:

python agent/create_stubs.py                      # all operators
python agent/create_stubs.py --level 1             # L1 only
python agent/create_stubs.py --architecture llama  # Llama only

Your replacement must be an nn.Module with the exact same class name and exact same forward signature as the baseline. You can use pure PyTorch, Triton, inline CUDA, or external libraries.

Show Example — replacing RMSNorm

# tasks/candidate/L1/rms_norm.py
import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, hidden_size: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(hidden_size))

    def forward(self, x, residual=None):
        if residual is not None:
            x = x + residual
            residual = x
        variance = x.float().pow(2).mean(dim=-1, keepdim=True)
        x = x * torch.rsqrt(variance + self.eps)
        result = (self.weight * x).to(x.dtype)
        if residual is not None:
            return result, residual
        return result

Benchmark and iterate

kb-nano provides hierarchical benchmarking tools. Start by testing individual kernels in isolation, then measure their end-to-end impact on the full model:

# Kernel-level: correctness + speedup for a single operator
kb_nano kernels --target rms_norm

# E2E: throughput with your candidates swapped into the model
kb_nano eval \
    --model meta-llama/Llama-3.1-8B-Instruct

A kernel benchmark is PASS if the candidate output passes allclose against the baseline. Speedup > 1.0 means your implementation is faster than the baseline.More benchmarking options

Start Here

User Guide

Developer Guide

Why you’ll love KB Nano

Quickstart

​Why you’ll love KB Nano

​Quickstart

Why you’ll love KB Nano

Quickstart