Documentation Index
Fetch the complete documentation index at: https://snowflake-84d72a0d.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
kb-nano uses a three-tier benchmarking system. Each tier adds scope: isolated kernels → full-model inference → multi-model evaluation sweeps.
| Tier | Scope | CLI |
|---|
| 1 — Kernel | Single operator forward() in isolation | kb_nano kernels |
| 2 — E2E | Full-model throughput, latency, serving | kb_nano e2e |
| 3 — Eval | Standardized multi-model sweep | kb_nano eval |
Tier 1: Kernel Benchmark
Tests a single operator replacement by instantiating baseline and candidate nn.Modules side-by-side, copying weights, and comparing forward() outputs and timing across a registry of input shapes.
# Run all operators that have candidates in tasks/candidate/
kb_nano kernels
# Single operator
kb_nano kernels --target rms_norm
# Filter by model / TP / category
kb_nano kernels --target rms_norm --model llama31 --tp 1
# List available targets
kb_nano kernels --list
kb_nano kernels --list --level 1
# Show model-to-operator mapping
kb_nano kernels --map
Key flags:
| Flag | Default | Description |
|---|
--target | all | Operator name (e.g. rms_norm, attention) |
--model | all | Filter scenarios by model prefix |
--tp | all | Filter by TP degree(s) |
--category | all | Filter by category (e.g. llm) |
--num-warmup | 10 | Warmup iterations |
--num-runs | 100 | Timed iterations for median |
--output-json | bench/results/kernels.json | Results path |
A scenario passes if max-abs-error < 0.01 (bfloat16 tolerance). Speedup > 1.0 means faster than baseline.
Tier 2: E2E Benchmark
Full-model inference benchmarks. The CLI mirrors vLLM’s interface, so the same flags work in both.
# Throughput
kb_nano e2e throughput \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random --random-input-len 512 --random-output-len 128 \
--num-prompts 200 --tp 4
# Latency
kb_nano e2e latency \
--model meta-llama/Llama-3.1-8B-Instruct \
--input-len 128 --output-len 128
# Online serving
kb_nano e2e serve [args...]
Tier 3: Eval Sweep
Runs a standardized evaluation across all candidate kernels using fixed workloads. Each job pair (baseline vs candidate) runs in an isolated subprocess to prevent contamination.
# Full sweep (auto-detects models from candidates)
kb_nano eval
# Specific model and TP
kb_nano eval \
--model meta-llama/Llama-3.1-8B-Instruct --tp 1 4
# Filter by category
kb_nano eval --category llm
# Control scale
kb_nano eval --num-prompts 500 --gpu-pool 4
Key flags:
| Flag | Default | Description |
|---|
--model | auto | HuggingFace model name(s) |
--tp | 1 4 | TP degree(s) to evaluate |
--category | all | Filter by category |
--num-prompts | 1000 | Prompts per throughput workload |
--gpu-pool | 8 | GPUs available for scheduling |
--seed | 42 | Random seed |
--output-json | bench/results/eval.json | Results path |
Standardized workloads
Eval uses fixed, non-configurable workloads for reproducibility:
Throughput (per model/TP pair):
prefill-heavy — 1024 input, 512 output tokens
balanced — 512 input, 512 output tokens
decode-heavy — 512 input, 1024 output tokens
Latency (per model/TP pair):
single-request — batch 1, 128 in/out tokens
fixed-batch-32 — batch 32, 128 in/out tokens
Kernel benchmarks (Tier 1) draw input shapes from a YAML manifest at bench/utils/inputs/llm.yaml. Each entry specifies tensor shapes, dtypes, and init args for a given operator × model × TP × sequence-length combination.
# Regenerate the manifest from HuggingFace model configs
kb_nano generate-inputs
# Capture golden data for data-dependent operators
kb_nano capture-golden \
--model meta-llama/Llama-3.1-8B-Instruct
Experiment Tracking
All three benchmark tiers automatically log their results to MLflow. After any kb_nano kernels, kb_nano e2e, or kb_nano eval run, you can query the results with kb_nano history or browse them in the MLflow web UI via kb_nano mlflow-ui. See Experiment Tracking for details.
Conflict Resolution
When multiple candidate kernels overlap (e.g. an L2 attention kernel internally replaces an L1 rms_norm), the kernel swapper detects subsumption automatically and warns. Candidates are always applied bottom-up (L1 → L4).