fastkernels includes an agent that uses an LLM to generate replacement kernels, validate them, and benchmark them end-to-end — all in a single run. You can use it as-is or as a starting point for building your own agent.Documentation Index
Fetch the complete documentation index at: https://snowflake-84d72a0d.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
How the Agent Works
The agent follows a four-stage pipeline:- Discover — Given a model and an operator level (L1–L4), the agent queries the benchmark registry to find all target operators.
- Generate — For each operator, the agent sends the baseline source code to an LLM with a prompt requesting a faster replacement. All operators are generated in parallel.
- Validate — Each generated kernel is compiled and checked: does the class name match? Does
__init__succeed? If validation fails, the error is fed back to the LLM for retry (up to--max-retriesattempts). - Benchmark — All successful kernels are patched into the model and benchmarked end-to-end, measuring token match rate and wall-clock speedup. If a kernel causes a runtime failure, the agent identifies it, re-generates it, and re-runs the benchmark.
Running the Agent
| Flag | Description |
|---|---|
--model | HuggingFace model name |
--level | Operator level: 1 (kernels), 2 (blocks), 3 (decoders), 4 (models) |
--cuda-only | Force raw CUDA only — no Triton or PyTorch builtins |
--max-retries | Max retries per kernel on compilation failure (default: 5) |
--tp | Tensor parallelism degree (default: 1) |
--llm-model | LLM model for generation (default: claude-opus-4-6) |
--skip-unit-tests | Skip per-operator unit tests, go straight to E2E benchmark |
tasks/candidate/L{level}/{op_name}.py.
Building Your Own Agent
The agent inagent/agent.py is structured around a few composable pieces you can reuse or replace:
Operator Discovery
discover_operators returns a list of OperatorSpec objects, each containing the operator’s name, level, class name, source code, and which models use it. This is all the context your agent needs to generate a replacement.
Prompt Construction
build_generation_prompt(op, cuda_only) constructs a detailed prompt that includes the baseline source code, the exact class and signature requirements, and performance guidance. build_retry_prompt(...) takes a failed attempt and its error message to produce a corrective prompt.
Validation
validate_kernel(code, expected_class_name) writes code to a temp file, imports it, and checks that the expected class exists and can be instantiated. Use this to gate submissions before running expensive benchmarks.
Benchmarking
Once your agent has produced kernels, place them attasks/candidate/L{level}/{op_name}.py and use the benchmarking tools to evaluate them — either programmatically via run_benchmark(...) or from the CLI.
Experiment Tracking
Allfastkernels agent runs are automatically logged to MLflow. You can also use the tracking API in your own agent to log kernel generations, benchmark results, and custom metrics:
fastkernels history to query past runs from the CLI, or fastkernels mlflow-ui to launch the web UI. See Experiment Tracking for full details.