fastkernels takes a top-down, model-driven approach: rather than collecting synthetic operators in isolation, benchmark tasks are dynamically instantiated from the configuration files of real production models. This ensures every operator is tested with realistic shapes, dtypes, and data flows. The layered hierarchy below makes it possible to benchmark a single kernel replacement while measuring its impact all the way up to full-model inference.Documentation Index
Fetch the complete documentation index at: https://snowflake-84d72a0d.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
The L1-L4 Hierarchy
All model operators live undertasks/baseline/ at four abstraction levels:
- L1 — Single-kernel ops (e.g.
rms_norm,silu_and_mul,rotary_emb,linear) - L2 — Multi-op blocks (e.g.
LlamaAttention,LlamaMLP,MixtralMoE,QKVParallelLinear) - L3 — Decoder layers (e.g.
LlamaDecoderLayer,MixtralDecoderLayer) - L4 — Full models (e.g.
LlamaForCausalLM,MixtralForCausalLM)
Interface Mirroring
Every module’s__init__ and forward signatures are designed to mirror the corresponding vLLM module. This makes it easy to port optimizations between the two codebases and ensures candidate kernels can be validated against a well-known reference.
| fastkernels module | vLLM equivalent | forward signature |
|---|---|---|
LlamaAttention | LlamaAttention | (positions, hidden_states) -> Tensor |
LlamaDecoderLayer | LlamaDecoderLayer | (positions, hidden_states, residual) -> (Tensor, Tensor) |
QKVParallelLinear | QKVParallelLinear | (x) -> (output, bias) |
RowParallelLinear | RowParallelLinear | (x) -> (output, bias) |
MergedColumnParallelLinear | MergedColumnParallelLinear | (x) -> (output, bias) |
VocabParallelEmbedding | VocabParallelEmbedding | (input_ids) -> Tensor |
ParallelLMHead | ParallelLMHead | (hidden_states) -> Tensor |
SiluAndMul | SiluAndMul | (x) -> Tensor |
RMSNorm | RMSNorm | (x, residual=None) -> Tensor or (Tensor, Tensor) |
(output, bias) tuples to match vLLM. LlamaAttention stores rotary_emb as an __init__ parameter (not a forward argument), matching vLLM’s design where each LlamaAttention owns its rotary embedding.