Stanford CS25: Serving Transformers - Lessons from the Trenches
Stanford CS25: Serving Transformers - Lessons from the Trenches
Executive Summary
Serving transformer models at scale requires a fundamental shift from focusing on model capabilities to optimizing for the "boring infra" of production inference. The core challenge lies in the disparity between the prefill (input processing) and decode (token generation) phases, where the decode phase is typically memory-bandwidth bound, leading to poor hardware utilization. Success in production inference is achieved by defining precise workloads via Service Level Objectives (SLOs), selecting models based on efficiency versus capability bounds, and employing aggressive optimization techniques like speculative decoding and quantization.
Defining Inference Workloads and SLOs
Effective inference engineering begins with a clear definition of the workload, which determines the hardware and software architecture. Workloads generally fall into three archetypes:
- Chatbot Plus: Human-interactive systems (e.g., ChatGPT, Claude) where the primary metric is interactivity, measured as output tokens per second per user.
- Background Agents: Systems that perform complex tasks over seconds or minutes (e.g., coding agents) where the primary metric is time to last token.
- Data Processors: High-volume, bursty workloads (e.g., document indexing) where the primary metric is aggregate throughput, often measured in mega-tokens per dollar.
Key Performance Metrics
To manage these workloads, engineers must track specific metrics per replica to determine how to scale the system:
- Time to First Token (TTFT): The latency before the user sees the first byte of output.
- Inter-token Latency: The time elapsed between sequential tokens.
- Queries Per Second (QPS): The volume of requests, which often exhibits high seasonality and variability.
- Prefix Reuse: The frequency of overlapping input tokens, which allows for KV caching to reduce computation costs.
Model Selection: Efficiency vs. Capability Bounds
Model choice is driven by whether a task is efficiency-bound or capability-bound:
Efficiency-Bound Workloads
In these tasks, the required level of intelligence is relatively low and easily achievable. Cost becomes the primary driver. These workloads are typically served by open-source models (e.g., Mistral, Gemma) and often run on single-GPU replicas to maximize cost-performance.
Capability-Bound Workloads
These tasks require the highest available intelligence, where scaling is currently the only solution. These models (e.g., frontier proprietary models) are massive, requiring multiple GPUs per replica and often multiple nodes. They are typically used as orchestrators in multi-agent systems, managing smaller, efficiency-bound sub-agents.
The Inference Stack: Engines and Hardware
Inference Engines
An inference engine consists of a server process for tokenization, a scheduler on the CPU, and the GPU execution. The scheduler's primary job is to ensure the GPU is never idle. Current leading open-source engines include:
- vLLM: Widely adopted with open governance.
- SGLang: Highly performance-focused with aggressive optimizations.
- TensorRT-LLM: A compiled C++ runtime from NVIDIA, optimal for small models and small batch sizes.
Hardware Constraints
Production inference is dominated by NVIDIA data center GPUs (SXM form factor) because of High Bandwidth Memory (HBM). The decode phase is severely memory-bound; while Tensor Cores can perform thousands of operations per byte, the decode phase only performs a few. This makes HBM and high-speed interconnects (NVLink) essential to avoid idling the compute units.
Deployment and Robustness at Scale
Serving thousands of GPUs introduces significant reliability and cost challenges:
- Hardware Failure: H100 GPUs have a mean time to failure measured in days or weeks. Systems must be built with redundancy so that hardware faults do not become system failures.
- Cold Start and Scaling: To handle variable traffic without wasting money on idle GPUs, systems need fast replica startup. This is achieved through lazy loading of file systems, multi-tier cloud caching, and checkpoint-restore technologies (like CRIU) to avoid the minutes-long JIT compilation and CUDA graph capture times.
Performance Optimization Techniques
Optimization should follow a hierarchy: start with high-impact algorithmic changes, move to host-side engineering, and finish with low-level kernel work.
Speculative Decoding
Speculative decoding addresses the memory-bandwidth bottleneck of the decode phase. A small "draft" model guesses the next few tokens, and the larger target model verifies them in a single forward pass. This can yield speedups of 2x to 8x, especially when using application-specific draft models.
Quantization
Reducing precision (e.g., from FP8 to FP4) provides a linear speedup in both memory-bound and compute-bound scenarios by reducing the bytes moved and increasing the operations per second. However, this requires careful evaluation via "vibe checks" and formal evals, as quantization can degrade performance on long sequences due to accumulated error.
Host-Side Optimization
Before writing custom CUDA kernels, engineers should use tools like py-spy to find Python-level bottlenecks. Simple changes, such as caching pointers in a dictionary instead of recreating tensors, can yield significant gains (e.g., 10% in multimodal inference) without requiring GPU-side changes.
Observability and Debugging
Observability is defined as the ability to debug a system solely from logs. Critical practices include:
- Logging Token IDs: Logging only strings is insufficient; token IDs are necessary to debug subtle tokenizer and chat template bugs.
- Monitoring Power Draw: GPU power draw and temperature are "free" metrics that can signal host-side bottlenecks (e.g., if a GPU is drawing 2kW instead of 3kW, the CPU is likely failing to feed it work).
- Evaluating Tails: P95 and P99 latencies are critical because they represent the "stuttering" experience of the user, even if the median (P50) latency is low.
Summary
Charles Frye of Modal discusses the engineering challenges of serving transformer models at scale, focusing on the critical distinction between prefill and decode phases and the optimization of hardware utilization.
Title
Stanford CS25: Serving Transformers - Lessons from the Trenches