Documentation Index
Fetch the complete documentation index at: https://snowflake-84d72a0d.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
vLLM Alignment Test
tests/bench_vllm.py runs fastkernels and vLLM side-by-side across three workload scenarios (prefill-heavy, balanced, decode-heavy) plus latency benchmarks, comparing throughput and per-token alignment:
tests/results/<GPU>/<model>_tp<N>/results.json. The parser auto-discovers these files and generates tables and plots in tests/plots/<GPU>/.
For a quick correctness check (no throughput measurement), use the --skip-throughput flag:
Adding a New Model
Follow the L1 → L4 pattern (see Architecture for details on the hierarchy):- L1: Identify which single-kernel ops are needed. Reuse existing L1 ops where possible (e.g.
RMSNorm,SiluAndMul). Write new L1 modules only for ops that don’t exist yet. - L2: Compose L1 ops into multi-op blocks (attention, MLP). Mirror the corresponding vLLM module’s
__init__andforwardsignatures. - L3: Write a decoder layer that combines L2 attention + L2 MLP + L1 normalization.
- L4: Write the full model class (
NewModelForCausalLM) with embedding, decoder stack, and LM head.