Benchmarking

kb-nano uses a three-tier benchmarking system. Each tier adds scope: isolated kernels → full-model inference → multi-model evaluation sweeps.

Tier	Scope	CLI
1 — Kernel	Single operator `forward()` in isolation	`kb_nano kernels`
2 — E2E	Full-model throughput, latency, serving	`kb_nano e2e`
3 — Eval	Standardized multi-model sweep	`kb_nano eval`

Tier 1: Kernel Benchmark

Tests a single operator replacement by instantiating baseline and candidate nn.Modules side-by-side, copying weights, and comparing forward() outputs and timing across a registry of input shapes.

# Run all operators that have candidates in tasks/candidate/
kb_nano kernels

# Single operator
kb_nano kernels --target rms_norm

# Filter by model / TP / category
kb_nano kernels --target rms_norm --model llama31 --tp 1

# List available targets
kb_nano kernels --list
kb_nano kernels --list --level 1

# Show model-to-operator mapping
kb_nano kernels --map

Key flags:

Flag	Default	Description
`--target`	all	Operator name (e.g. `rms_norm`, `attention`)
`--model`	all	Filter scenarios by model prefix
`--tp`	all	Filter by TP degree(s)
`--category`	all	Filter by category (e.g. `llm`)
`--num-warmup`	10	Warmup iterations
`--num-runs`	100	Timed iterations for median
`--output-json`	`bench/results/kernels.json`	Results path

A scenario passes if max-abs-error < 0.01 (bfloat16 tolerance). Speedup > 1.0 means faster than baseline.

Tier 2: E2E Benchmark

Full-model inference benchmarks. The CLI mirrors vLLM’s interface, so the same flags work in both.

# Throughput
kb_nano e2e throughput \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name random --random-input-len 512 --random-output-len 128 \
    --num-prompts 200 --tp 4

# Latency
kb_nano e2e latency \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --input-len 128 --output-len 128

# Online serving
kb_nano e2e serve [args...]

Tier 3: Eval Sweep

Runs a standardized evaluation across all candidate kernels using fixed workloads. Each job pair (baseline vs candidate) runs in an isolated subprocess to prevent contamination.

# Full sweep (auto-detects models from candidates)
kb_nano eval

# Specific model and TP
kb_nano eval \
    --model meta-llama/Llama-3.1-8B-Instruct --tp 1 4

# Filter by category
kb_nano eval --category llm

# Control scale
kb_nano eval --num-prompts 500 --gpu-pool 4

Key flags:

Flag	Default	Description
`--model`	auto	HuggingFace model name(s)
`--tp`	`1 4`	TP degree(s) to evaluate
`--category`	all	Filter by category
`--num-prompts`	1000	Prompts per throughput workload
`--gpu-pool`	8	GPUs available for scheduling
`--seed`	42	Random seed
`--output-json`	`bench/results/eval.json`	Results path

Standardized workloads

Eval uses fixed, non-configurable workloads for reproducibility: Throughput (per model/TP pair):

prefill-heavy — 1024 input, 512 output tokens
balanced — 512 input, 512 output tokens
decode-heavy — 512 input, 1024 output tokens

Latency (per model/TP pair):

single-request — batch 1, 128 in/out tokens
fixed-batch-32 — batch 32, 128 in/out tokens

Input Registry

Kernel benchmarks (Tier 1) draw input shapes from a YAML manifest at bench/utils/inputs/llm.yaml. Each entry specifies tensor shapes, dtypes, and init args for a given operator × model × TP × sequence-length combination.

# Regenerate the manifest from HuggingFace model configs
kb_nano generate-inputs

# Capture golden data for data-dependent operators
kb_nano capture-golden \
    --model meta-llama/Llama-3.1-8B-Instruct

Experiment Tracking

All three benchmark tiers automatically log their results to MLflow. After any kb_nano kernels, kb_nano e2e, or kb_nano eval run, you can query the results with kb_nano history or browse them in the MLflow web UI via kb_nano mlflow-ui. See Experiment Tracking for details.

Conflict Resolution

When multiple candidate kernels overlap (e.g. an L2 attention kernel internally replaces an L1 rms_norm), the kernel swapper detects subsumption automatically and warns. Candidates are always applied bottom-up (L1 → L4).

Start Here

User Guide

Developer Guide

Tier 1: Kernel Benchmark

Tier 2: E2E Benchmark

Tier 3: Eval Sweep

Standardized workloads

Input Registry

Experiment Tracking

Conflict Resolution

​Tier 1: Kernel Benchmark

​Tier 2: E2E Benchmark

​Tier 3: Eval Sweep

​Standardized workloads

​Input Registry

​Experiment Tracking

​Conflict Resolution

Tier 1: Kernel Benchmark

Tier 2: E2E Benchmark

Tier 3: Eval Sweep

Standardized workloads

Input Registry

Experiment Tracking

Conflict Resolution