Skip to main content

Documentation Index

Fetch the complete documentation index at: https://snowflake-84d72a0d.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

kb-nano uses a three-tier benchmarking system. Each tier adds scope: isolated kernels → full-model inference → multi-model evaluation sweeps.
TierScopeCLI
1 — KernelSingle operator forward() in isolationkb_nano kernels
2 — E2EFull-model throughput, latency, servingkb_nano e2e
3 — EvalStandardized multi-model sweepkb_nano eval

Tier 1: Kernel Benchmark

Tests a single operator replacement by instantiating baseline and candidate nn.Modules side-by-side, copying weights, and comparing forward() outputs and timing across a registry of input shapes.
# Run all operators that have candidates in tasks/candidate/
kb_nano kernels

# Single operator
kb_nano kernels --target rms_norm

# Filter by model / TP / category
kb_nano kernels --target rms_norm --model llama31 --tp 1

# List available targets
kb_nano kernels --list
kb_nano kernels --list --level 1

# Show model-to-operator mapping
kb_nano kernels --map
Key flags:
FlagDefaultDescription
--targetallOperator name (e.g. rms_norm, attention)
--modelallFilter scenarios by model prefix
--tpallFilter by TP degree(s)
--categoryallFilter by category (e.g. llm)
--num-warmup10Warmup iterations
--num-runs100Timed iterations for median
--output-jsonbench/results/kernels.jsonResults path
A scenario passes if max-abs-error < 0.01 (bfloat16 tolerance). Speedup > 1.0 means faster than baseline.

Tier 2: E2E Benchmark

Full-model inference benchmarks. The CLI mirrors vLLM’s interface, so the same flags work in both.
# Throughput
kb_nano e2e throughput \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name random --random-input-len 512 --random-output-len 128 \
    --num-prompts 200 --tp 4

# Latency
kb_nano e2e latency \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --input-len 128 --output-len 128

# Online serving
kb_nano e2e serve [args...]

Tier 3: Eval Sweep

Runs a standardized evaluation across all candidate kernels using fixed workloads. Each job pair (baseline vs candidate) runs in an isolated subprocess to prevent contamination.
# Full sweep (auto-detects models from candidates)
kb_nano eval

# Specific model and TP
kb_nano eval \
    --model meta-llama/Llama-3.1-8B-Instruct --tp 1 4

# Filter by category
kb_nano eval --category llm

# Control scale
kb_nano eval --num-prompts 500 --gpu-pool 4
Key flags:
FlagDefaultDescription
--modelautoHuggingFace model name(s)
--tp1 4TP degree(s) to evaluate
--categoryallFilter by category
--num-prompts1000Prompts per throughput workload
--gpu-pool8GPUs available for scheduling
--seed42Random seed
--output-jsonbench/results/eval.jsonResults path

Standardized workloads

Eval uses fixed, non-configurable workloads for reproducibility: Throughput (per model/TP pair):
  • prefill-heavy — 1024 input, 512 output tokens
  • balanced — 512 input, 512 output tokens
  • decode-heavy — 512 input, 1024 output tokens
Latency (per model/TP pair):
  • single-request — batch 1, 128 in/out tokens
  • fixed-batch-32 — batch 32, 128 in/out tokens

Input Registry

Kernel benchmarks (Tier 1) draw input shapes from a YAML manifest at bench/utils/inputs/llm.yaml. Each entry specifies tensor shapes, dtypes, and init args for a given operator × model × TP × sequence-length combination.
# Regenerate the manifest from HuggingFace model configs
kb_nano generate-inputs

# Capture golden data for data-dependent operators
kb_nano capture-golden \
    --model meta-llama/Llama-3.1-8B-Instruct

Experiment Tracking

All three benchmark tiers automatically log their results to MLflow. After any kb_nano kernels, kb_nano e2e, or kb_nano eval run, you can query the results with kb_nano history or browse them in the MLflow web UI via kb_nano mlflow-ui. See Experiment Tracking for details.

Conflict Resolution

When multiple candidate kernels overlap (e.g. an L2 attention kernel internally replaces an L1 rms_norm), the kernel swapper detects subsumption automatically and warns. Candidates are always applied bottom-up (L1 → L4).